## How to parse text data

## Example:

### Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). 

### The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.

## NLTK

In [1]:
import nltk

# split sentences
from nltk import sent_tokenize
nltk.download('punkt')

# split into words
from nltk.tokenize import word_tokenize

# remove stop words
from nltk.corpus import stopwords
nltk.download('stopwords')

# stemming of words
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
text = "Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). \
The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, \
[17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural."

print(text)

# split sentences
sentences = sent_tokenize(text)
print(sentences)

# split into words
words = word_tokenize(text)
print(words)

# remove stop words
stop_words = stopwords.words('english')
words = [w for w in words if not w in stop_words]
print(words)

# stemming words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]

print(stemmed)

Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.
['Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2).', "The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural."]
['Texas', 'is', 'the', 'second-largest', 'U.S.', 'state', ',', 'after', 'Alaska', ',', 'with', 'an', 'area', 'of', '268,820', 'square', 'miles', '(', '696,200', 'km2', ')', '.', 'The', 'name', 'Texas', ',', 'based', 'on', 'the', 'Caddo', 'word', 'táyshaʼ', '(', "/t'ajʃaʔ/", ')',

## Spacy

In [3]:
text = "Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). \
The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, \
[17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural."

print(text)

import spacy
nlp = spacy.load('en')

document = nlp(text)

# split sentences
print(list(document.sents))

# split into words
print(list(document))

# remove stop words
print([word for word in document if not word.is_stop])

# lematization
lemmas = [token.lemma_ for token in document if not token.is_stop]
print(lemmas)

Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.
[Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2)., The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.]
[Texas, is, the,    , second, -, largest, U.S., state, ,, after, Alaska, ,, with, an, area, of, 268,820, square, miles, (, 696,200, km2, ), ., The, name, Texas, ,, based, on, the, Caddo, word, táyshaʼ, (, /t'ajʃaʔ/, ), ', friend, ', ,, was, applied, ,, in, the, spelling, Tejas, or, 

## Cleaning

In [4]:
import re
from unicodedata import normalize

text = "Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). \
The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, \
[17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural."

print(text)

# normalize unicode
text = normalize('NFD', text).encode('ascii', 'ignore')
text = text.decode('utf-8')
print(text)

# remove punctuation
number_handler = re.compile(r'(?<=\d),(?=\d)')
punct_re = re.compile('[{}]'.format(re.escape('!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~-')))

abreviation = re.compile('[^a-zA-Z0-9-_.]')
text = abreviation.sub(' ', text)
print(text)

text = number_handler.sub('',text)
text = punct_re.sub(' ', text)
print(text)

# remove any double whitespace
text = ' '.join(text.split())
print(text)

Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). The name Texas, based on the Caddo word táyshaʼ (/t'ajʃaʔ/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.
Texas is the    second-largest U.S. state, after Alaska, with an area of 268,820 square miles (696,200 km2). The name Texas, based on the Caddo word taysha (/t'aja/) 'friend', was applied, in the spelling Tejas or Texas, [17] by the Spanish to the Caddo themselves, specifically the Hasinai Confederacy,[18] the final -s representing the Spanish plural.
Texas is the    second-largest U.S. state  after Alaska  with an area of 268 820 square miles  696 200 km2 . The name Texas  based on the Caddo word taysha   t aja    friend   was applied  in the spelling Tejas or Texas   17  by the Spanish to the Caddo themselves  specifically the Hasina

In [0]:
""" CLEAN TEXT FUNCTION """

def clean_text(text):
  import re
  from unicodedata import normalize
  
  # normalize unicode
  clean = normalize('NFD', text).encode('ascii', 'ignore')
  clean = clean.decode('utf-8')

  # remove punctuation
  number_handler = re.compile(r'(?<=\d),(?=\d)')
  punct_re = re.compile('[{}]'.format(re.escape('!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~-')))

  abreviation = re.compile('[^a-zA-Z0-9-_.]')
  clean = abreviation.sub(' ', clean)

  clean = number_handler.sub('',clean)
  clean = punct_re.sub(' ', clean)

  # remove any double whitespace
  clean = ' '.join(clean.split())
  
  return clean


In [6]:
clean_text(text)

'Texas is the second largest U S state after Alaska with an area of 268 820 square miles 696 200 km2 The name Texas based on the Caddo word taysha t aja friend was applied in the spelling Tejas or Texas 17 by the Spanish to the Caddo themselves specifically the Hasinai Confederacy 18 the final s representing the Spanish plural'