# Lesson 38: Text data analysis demonstration

This notebook demonstrates key concepts and tools for text data analysis in NLP.

**1. Text preprocessing**
- Tokenization
- Normalization and cleaning
- Stemming and lemmatization

**2. Text exploration**
- Word frequency analysis
- Word cloud visualization

**3. Text classification**
- Naive Bayes classifier

**4. Rule-based sentiment analysis**
- VADER sentiment analyzer


## Notebook set up

### Imports

In [None]:
import nltk
import pandas as pd
import matplotlib.pyplot as plt

from collections import Counter
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, movie_reviews
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from wordcloud import WordCloud

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('movie_reviews', quiet=True)
nltk.download('vader_lexicon', quiet=True)

### Dataset

We use the NLTK movie reviews corpus, which contains 2000 movie reviews labeled as positive or negative.

In [None]:
# Load movie reviews corpus
documents = [(movie_reviews.raw(fileid), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Create dataframe
df = pd.DataFrame(documents, columns=['text', 'label'])

print(f'Dataset shape: {df.shape}')
print(f'Label distribution:\n{df["label"].value_counts()}')
df.head()

## 1. Text preprocessing

### 1.1. Tokenization

Tokenization splits text into individual units (tokens) such as words or sentences.

NLTK [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html) documentation

In [None]:
# Sample text for demonstration
sample_text = df['text'].iloc[0][:500]

# Word tokenization
word_tokens = word_tokenize(sample_text)

# Sentence tokenization
sent_tokens = sent_tokenize(sample_text)

print(f'Original text:\n{sample_text}\n')
print(f'Word tokens ({len(word_tokens)} tokens):\n{word_tokens[:20]}...\n')
print(f'Sentence tokens ({len(sent_tokens)} sentences):\n{sent_tokens[:2]}')

### 1.2. Normalization and cleaning

Text normalization includes lowercasing, removing punctuation, and filtering stopwords.

In [None]:
# Lowercase
tokens_lower = [token.lower() for token in word_tokens]

# Remove non-alphabetic tokens
tokens_alpha = [token for token in tokens_lower if token.isalpha()]

# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens_clean = [token for token in tokens_alpha if token not in stop_words]

print(f'Original tokens: {len(word_tokens)}')
print(f'After lowercasing: {len(tokens_lower)}')
print(f'After removing non-alpha: {len(tokens_alpha)}')
print(f'After removing stopwords: {len(tokens_clean)}')
print(f'\nCleaned tokens:\n{tokens_clean[:15]}')

### 1.3. Stemming and lemmatization

Stemming reduces words to their root form by removing suffixes. Lemmatization reduces words to their dictionary form (lemma).

NLTK [PorterStemmer](https://www.nltk.org/api/nltk.stem.porter.html) and [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html) documentation

In [None]:
# Stemming
stemmer = PorterStemmer()
tokens_stemmed = [stemmer.stem(token) for token in tokens_clean]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens_lemmatized = [lemmatizer.lemmatize(token) for token in tokens_clean]

# Compare results
comparison_df = pd.DataFrame({
    'original': tokens_clean[:10],
    'stemmed': tokens_stemmed[:10],
    'lemmatized': tokens_lemmatized[:10]
})

comparison_df

## 2. Text exploration

### 2.1. Word frequency analysis

In [None]:
# Preprocess function for full corpus
def preprocess_text(text):

    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    return tokens

# Get all tokens from corpus
all_tokens = []

for text in df['text']:
    all_tokens.extend(preprocess_text(text))

# Count word frequencies
word_freq = Counter(all_tokens)
top_20 = word_freq.most_common(20)

# Display top words
freq_df = pd.DataFrame(top_20, columns=['word', 'count'])
freq_df

### 2.2. Word cloud visualization

In [None]:
# Create word cloud
wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white',
    colormap='Greys'
).generate_from_frequencies(word_freq)

plt.figure(figsize=(10, 5))
plt.title('Word cloud of movie reviews corpus')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## 3. Text classification

### 3.1. Naive Bayes classifier

Naive Bayes is a simple but effective classifier for text that uses word frequencies as features.

Scikit-learn [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) documentation

In [None]:
# Prepare data
X = df['text']
y = df['label']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=315
)

# Vectorize text using bag-of-words
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f'Training set: {X_train_vec.shape}')
print(f'Test set: {X_test_vec.shape}')

In [None]:
# Train Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = nb_classifier.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.3f}\n')
print('Classification report:')
print(classification_report(y_test, y_pred))

## 4. Rule-based sentiment analysis

### 4.1. VADER sentiment analyzer

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based sentiment analyzer that uses a dictionary of words with pre-assigned sentiment scores.

NLTK [VADER](https://www.nltk.org/howto/sentiment.html) documentation

In [None]:
# Initialize VADER
sia = SentimentIntensityAnalyzer()

# Example sentences
example_texts = [
    'This movie was absolutely fantastic! I loved every minute.',
    'Terrible film. Complete waste of time and money.',
    'The movie was okay. Nothing special but watchable.',
    'I have mixed feelings about this one.'
]

# Analyze sentiment
print('VADER sentiment scores:\n')

for text in example_texts:

    scores = sia.polarity_scores(text)
    print(f'Text: {text}')
    print(f'Scores: {scores}\n')

### 4.2. VADER evaluation on corpus

In [None]:
# Apply VADER to test set
def vader_predict(text):

    scores = sia.polarity_scores(text)
    if scores['compound'] >= 0.05:
        return 'pos'
    elif scores['compound'] <= -0.05:
        return 'neg'
    else:
        return 'neg'  # Default to negative for neutral

# Predict with VADER
y_pred_vader = [vader_predict(text) for text in X_test]
vader_accuracy = accuracy_score(y_test, y_pred_vader)

print(f'VADER accuracy: {vader_accuracy:.3f}')
print(f'Naive Bayes accuracy: {accuracy:.3f}')