<a href="https://colab.research.google.com/github/junting-huang/data_storytelling/blob/main/case_2_number.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# case_2. number

In this tutorial, we explored three widely used Python libraries: pandas, matplotlib, and NLTK. By combining these three packages, you can efficiently manipulate, visualize, and analyze data, including text data, in a variety of ways.

* Pandas: Pandas is a robust data manipulation and analysis library, offering convenient data structures like DataFrame and Series for efficient handling and analysis of structured data. Pandas facilitates data manipulation tasks such as filtering, grouping, merging, and reshaping. It also handles missing data and supports dataset cleaning.

* Matplotlib: Matplotlib stands out as a versatile 2D plotting library, enabling the creation of static, animated, and interactive visualizations in Python. Matplotlib supports diverse plot types like line plots, scatter plots, bar plots, and histograms.

* NLTK (Natural Language Toolkit): NLTK serves as a powerful library for natural language processing tasks, offering user-friendly interfaces for tasks like tokenization, stemming, tagging, parsing, and more. NLTK also provides functionality for named entity recognition, sentiment analysis, and other language-related tasks.


Exploratory Data Analysis (EDA) on text data is a vital process that involves delving into the structure, characteristics, and content of textual information to uncover meaningful insights. Starting with an overview of the corpus, EDA includes essential tasks such as tokenization, preprocessing, and statistical analysis. Text statistics, visualizations like word clouds and bar charts, and N-gram analysis offer a comprehensive understanding of word frequencies, document lengths, and patterns beyond individual words. Sentiment analysis, named entity recognition (NER), topic modeling, and document similarity measures further enhance the exploration process. The use of interactive tools and consideration of contextual analysis contribute to a holistic approach in extracting valuable information from text data. 


The objective of this lab session is to guide you through the process of conducting Exploratory Data Analysis (EDA) on textual data. The focus is on providing practical insights into effectively exploring and understanding the characteristics of text data.

## 2.1 importing library

In [None]:
import nltk
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## 2.2 importing corpus

In [None]:
# Reading the literary work from a text file
filename = './data/walden.txt'

with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

In [None]:
print(text)

## 2.3 tokenizing the text

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down a text into individual units, or tokens. We use Punkt tokenizer to tokeize text. The Punkt tokenizer is a pre-trained unsupervised machine learning model for tokenizing text into sentences. It is part of the Natural Language Toolkit (NLTK) library. The Punkt tokenizer uses a combination of unsupervised and rule-based methods to achieve accurate sentence segmentation. It is trained on a large corpus of text and is capable of handling various languages.

In [None]:
# Downloading the Punkt Tokenizer Models
nltk.download('punkt')
nltk.download('stopwords')

# Tokenizing the text
tokens = word_tokenize(text.lower())  # Convert text to lowercase and tokenize

In [None]:
print(tokens)

The .isalnum() method is a string method in Python that checks whether all the characters in a given string are alphanumeric. Alphanumeric characters are those that are either alphabets (a-z or A-Z) or numeric digits (0-9). If all the characters in the string are alphanumeric, the method returns True; otherwise, it returns False.

In [None]:
# Removing stopwords and punctuation
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]

# Counting word frequencies
word_frequencies = nltk.FreqDist(filtered_tokens)

In [None]:
type(word_frequencies.items())

## 2.4 counting words

We create a DataFrame named df from a dictionary of word frequencies with columns 'Word' and 'Frequency'. After that, we sort the DataFrame based on the 'Frequency' column in descending order.

In [None]:
# Creating a DataFrame
df = pd.DataFrame(word_frequencies.items(), columns=['Word', 'Frequency'])

# Sorting the DataFrame by Frequency
df = df.sort_values(by='Frequency', ascending=False)

In [None]:
# Showing the top 5 frequent words
df.head()

## 2.5 plotting the graph

In a data analysis task, we often want to use visualizations to gain insight on the data/text at hand. For example, creating a plot for the top-k frequent words, often in the form of a bar chart or a word cloud, serves several purposes in data analysis and text visualization:

* Identify Key Themes: Analyzing the most frequent words helps identify key themes, topics, or subjects present in the text. This is particularly useful for understanding the primary focus or content of a document or corpus.

* Content Summary: The top words provide a concise summary of the content. This is beneficial when dealing with large amounts of text, allowing users to quickly grasp the main ideas without reading the entire text.

* Insights into Language Usage: Analyzing frequent words offers insights into language usage and style. It helps understand the vocabulary and common phrases used, providing context about the writing style.

We use matplotlib package to create a bar plot of the top 20 frequent words from our DataFrame df. The code below specifies the figure size, uses the head(20) method to select the top 20 frequent words, and then creates a bar plot using the 'Word' column for the x-axis and the 'Frequency' column for the y-axis. If you're looking for more information on plotting with Matplotlib, please check out: https://matplotlib.org/stable/users/explain/quick_start.html#quick-start. 

In [None]:
# Plotting the top 20 frequent words
plt.figure(figsize=(20,6))
df.head(20).plot(x='Word', y='Frequency', kind='bar', legend=False, color='teal')
plt.title('Top 10 Most Frequent Words')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

Is this word frequency bar plot useful in understanding the key themes or topics within the Walden?

## 2.5 term frequency - inverse document frequency (TF-IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a term in a document relative to a collection of documents (corpus). It is commonly used in natural language processing and information retrieval to extract meaningful information from a large set of documents. TF-IDF consists of two main components:

Term Frequency (TF): Measures how often a term appears in a document. It is calculated as the ratio of the number of times a term occurs in a document to the total number of terms in that document. The idea is to give higher weights to terms that appear frequently in a document, as they are likely to be more significant.
​

Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus. It is calculated as the logarithm of the ratio of the total number of documents in the corpus to the number of documents containing the term.
The goal is to assign higher weights to terms that are rare across the entire corpus, making them more distinctive.


By multiplying these values together we can get our final TF-IDF value.

Now we use the TfidfVectorizer from sklearn package to convert Walden text into a TF-IDF matrix. The resulting tfidf_matrix is a sparse matrix where each row corresponds to a sentence, and each column corresponds to a unique term in the vocabulary. The values in the matrix represent the TF-IDF scores for each term in each sentence.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = nltk.sent_tokenize(text)
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

We create a DataFrame named df from a dictionary of word importance with index by 'Word'. After that, we sort the DataFrame based on the 'Importance' in descending order. Similarly, we use matplotlib package to create a bar plot of the top 20 important words from our DataFrame df. 

In [None]:
# Getting feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Summing the TF-IDF scores for each word
word_importance = tfidf_matrix.sum(axis=0)

# Creating a DataFrame to store words and their importance
df = pd.DataFrame(word_importance.T, index=feature_names, columns=["Importance"])

# Sorting the DataFrame by Importance
df = df.sort_values(by='Importance', ascending=False)

In [None]:
df.head()

In [None]:
# Plotting the top 20 most important words
plt.figure(figsize=(20,6))
df.head(20).plot(kind='bar', legend=False, color='teal', figsize=(10,6))
plt.title('Top 20 Most Important Words')
plt.xlabel('Words')
plt.ylabel('Importance')
plt.xticks(rotation=45)
plt.show()

Is this word frequency bar plot more informative than the previous one?

## 2.6 Word Cloud

Word Cloud is a another data visualization technique used to represent the most frequently occurring words in a given text or corpus. It provides a visually striking and intuitive way to analyze and understand the prominent words in a body of text. The size of each word in the cloud is proportional to its frequency or importance in the text.

In python, you can use *WordCloud* library to create a word cloud. *WordCloud* is widely used in data analysis, text mining, and exploratory data analysis to quickly grasp the most relevant terms in a given textual dataset.

In [None]:
from wordcloud import WordCloud

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the generated word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
