# Linguistic Feature Exploration

## Syntax Analysis

In this example, spaCy is used to perform syntax analysis on a sentence. Syntax refers to the arrangement of words in a sentence to make grammatical sense.


This script processes a sentence and prints out the attributes of each token (word). These attributes include the lemma (base form of the word), part of speech (POS), detailed part-of-speech tag, syntactic dependency (how words are related to each other), the shape of the word, whether it's alphabetical, and whether it's considered a stopword.




In [1]:
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

In [3]:
# Process a sentence
sentence = "The quick brown fox runs faster the lazy dog."
doc = nlp(sentence)

# Token attributes for syntax analysis
print("Text\tLemma\tPOS\tTag\tDep\tShape\tis_alpha\tis_stop")
for token in doc:
    print(f"{token.text}\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t{token.dep_}\t{token.shape_}\t{token.is_alpha}\t{token.is_stop}")

Text	Lemma	POS	Tag	Dep	Shape	is_alpha	is_stop
The	the	DET	DT	det	Xxx	True	True
quick	quick	ADJ	JJ	amod	xxxx	True	False
brown	brown	ADJ	JJ	amod	xxxx	True	False
fox	fox	NOUN	NN	nsubj	xxx	True	False
runs	run	VERB	VBZ	ROOT	xxxx	True	False
faster	fast	ADV	RBR	advmod	xxxx	True	False
the	the	DET	DT	det	xxx	True	True
lazy	lazy	ADJ	JJ	amod	xxxx	True	False
dog	dog	NOUN	NN	dobj	xxx	True	False
.	.	PUNCT	.	punct	.	False	False


## Semantic Similarity

After processing two sentences with spaCy, the script calculates a similarity score based on the semantic meaning of the words in each sentence. The score is a number between 0 and 1, where 1 means identical text

In [11]:
doc1 = nlp("Burgers and Fries are not my favorties.")
doc2 = nlp("Burgers and Fries are my favorties.")

In [12]:
doc1.similarity(doc2)

  doc1.similarity(doc2)


0.9101511475178125

In [None]:
# Process some text


# Get the similarity between two docs
similarity = doc1.similarity(doc2)
print(f"Similarity: {similarity}")

Similarity: 0.2030433545638997


  similarity = doc1.similarity(doc2)


## Structure Analysis with TextBlob

TextBlob is used here to parse a paragraph and extract noun phrases. Noun phrases give us insight into the key subjects and topics within the text, which is a structural component of language understanding

In [None]:
import nltk

# Download packages from nltk
nltk.download('all')

In [None]:
from textblob import TextBlob

# Example paragraph
paragraph = """
Natural Language Processing enables
the computer to understand human language.
It is a field of study focused on making
sense of language and getting the computer
to perform useful tasks utilizing the
language data.
"""

# Create a TextBlob object
blob = TextBlob(paragraph)

# Noun Phrase extraction for structure analysis
print("Noun Phrases:")
for np in blob.noun_phrases:
    print(np)

Noun Phrases:
language processing
human language
useful tasks
language data


# Text Preprocessing

## Stopword removal

This script uses the NLTK library to filter out stopwords from a sample sentence. It first tokenizes the sentence into words and then removes words that are present in the predefined list of English stopwords.

In [34]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

text = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(text)

filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
filtered_text = " ".join(filtered_text)

print(filtered_text)


sample sentence , showing stop words filtration .


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
stop_words.discard('a')

## Punctuation Removal using String Library

This code removes punctuation using Python's built-in string library. It translates every punctuation mark into a None character, effectively removing them from the text.



In [None]:
import string

text = "This is a sample sentence, with some punctuation!"
clean_text = text.translate(str.maketrans('', '', string.punctuation))

print(clean_text)


This is a sample sentence with some punctuation


## POS tagging with NLTK

This code tokenizes a sentence into words and then uses NLTK's pos_tag function to assign part-of-speech tags to each token.



In [23]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('averaged_perceptron_tagger')

text = "This is a dog"
word_tokens = word_tokenize(text)

pos_tags = nltk.pos_tag(word_tokens)

print(pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('dog', 'NN')]


## Tokenization using NLTK

This script uses NLTK's word_tokenize function to split the text into a list of words. This process is known as tokenization, which is the first step in text preprocessing for NLP

In [None]:
from nltk.tokenize import word_tokenize

nltk.download('punkt')

text = "Natural language processing is a field of computer science, AI"
tokens = word_tokenize(text)

print(tokens)


['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'computer', 'science', ',', 'AI']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [26]:
import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


ps = PorterStemmer()
ps


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<PorterStemmer>

In [31]:
#Stemming

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
print('Stemming word Example')
print(example_words)
print('\n')
print('After Stemming, Words are')
# Next, we can easily stem by doing something like:
for w in example_words:
    print(ps.stem(w))

Stemming word Example
['python', 'pythoner', 'pythoning', 'pythoned', 'pythonly']


After Stemming, Words are
python
python
python
python
pythonli


In [29]:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
print('Another Example')
print(new_text)
print('\n')

print('For each word, stemming done as follows')
# Word Tokenizer
words = word_tokenize(new_text)
# For each word, stemming done
for w in words:
    print(ps.stem(w))

Another Example
It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.


For each word, stemming done as follows
it
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


In [30]:

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run
