# ***Text Extraction***

Text extraction is another widely used text analysis technique that extracts pieces of data that already exist within any given text.

# Keyword Extraction

Keywords are the most used and most relevant terms within a text, words and phrases that summarize the contents of text.

# Entity Recognition

A named entity recognition (NER) extractor finds entities, which can be people, companies, or locations and exist within text data.



In [6]:
import pandas as pd
data = pd.read_csv('US Christmas Tree Sales.csv')
data.head(4)

Unnamed: 0,index,Year,Type of tree,Number of trees sold,Average Tree Price,Sales
0,0,2010,Real tree,27000000,36.12,975240000
1,1,2011,Real tree,30800000,34.87,1073996000
2,2,2012,Real tree,24500000,40.3,987350000
3,3,2013,Real tree,33020000,35.3,1165606000


In [7]:
text_data = data['Type of tree']

In [8]:
for text in text_data:
    print(text)

Real tree
Real tree
Real tree
Real tree
Real tree
Real tree
Real tree
Fake tree
Fake tree
Fake tree
Fake tree
Fake tree
Fake tree
Fake tree


# ***Word frequency***

Word frequency is a text analysis technique that measures the most frequently occurring words or concepts in a given text using the numerical statistic TF-IDF (term frequency-inverse document frequency).



In [21]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords


In [22]:
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
text_data = [data]
df = pd.DataFrame(data)

In [26]:
def calculate_word_frequencies(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    tokens = [word for word in tokens if word.isalnum()]  # Remove punctuation

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Calculate word frequency
    freq_dist = FreqDist(filtered_tokens)
    return freq_dist

In [27]:
word_freq_dict = {}
for text in df['Type of tree']:
    freq_dist = calculate_word_frequencies(text)
    for word, frequency in freq_dist.items():
        if word in word_freq_dict:
            word_freq_dict[word] += frequency
        else:
            word_freq_dict[word] = frequency

In [28]:
print("Word frequencies:")
for word, frequency in word_freq_dict.items():
    print(f"{word}: {frequency}")

Word frequencies:
real: 7
tree: 14
fake: 7


# **Collocation**

Collocation helps identify words that commonly co-occur.

# **Concordance**

Concordance helps identify the context and instances of words or a set of words.
