# case_3. data

In last lab, we introduced exploratory data analysis (EDA) on textual data using popular python packages such as pandas, matplotlib, and NLTK. However, real-world textual data often comes in messy and unstructured formats, presenting challenges for analysis and interpretation. This lab serves as an introductory guide, demonstrating how to preprocess raw textual data obtained from the internet, scraped from websites, or sourced from other data repositories.

Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming, lemmatization, stop-word and punctuation removal, and part-of-speech tagging. The goal is to enhance the quality of the textual data, reduce noise, and standardize its structure. In this lab, we will introduce the basics of text preprocessing and provide Python code examples to illustrate how to implement these tasks using the NLTK library. By the end of the lab, readers will better understand how to prepare text data for many NLP tasks.

## 3.1 Import packages and dataset

As always, we need to import packages and raw dataset. In Python's scikit-learn library, the *sklearn.datasets* module provides a collection of functions and utilities for loading and fetching standard datasets for machine learning. This module is particularly useful for quickly experimenting with algorithms and performing initial data exploration. It includes functions to download and load various well-known datasets, such as the iris dataset, digits dataset, and breast cancer dataset, among others. These datasets cover a range of problem types, including classification, regression, and clustering. The sklearn.datasets module simplifies the process of acquiring sample datasets, making it convenient for practitioners to test and prototype machine learning models without the need for external data sources. This functionality is beneficial for both beginners learning machine learning concepts and experienced practitioners who want to quickly assess the performance of algorithms on standard datasets.


The 20newsgroups dataset in Python, commonly used in the field of text classification and information retrieval. This dataset consists of approximately 20,000 newsgroup documents, covering 20 different newsgroups or topics. These topics span a diverse range of subjects, including politics, sports, technology, and more. The 20newsgroups dataset serves as a benchmark for text classification algorithms, allowing researchers and practitioners to evaluate the performance of models in distinguishing between various news categories. Each document in the dataset is labeled with its corresponding newsgroup, making it suitable for supervised learning tasks. The dataset is often utilized to test and develop algorithms for document classification, topic modeling, and text clustering, making it a valuable resource for the evaluation of natural language processing and machine learning techniques. In this lab, we use 20newsgroups dataset to demonstrate common text preprocessing tasks. Please feel free to replace this corpus to any text of your interest. 

In [1]:
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups

from nltk.tokenize import word_tokenize

import spacy

documents, _ = fetch_20newsgroups(
    shuffle=True,
    random_state=1,
    remove=("headers", "footers", "quotes"),
    return_X_y=True,
)


Let's take a look of the first document in this corpus:

In [2]:
text = documents[0]
print(text)

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



## 3.2 Stopwords Removal

In NLP, stopwords refer to common words that are often removed from text data during preprocessing because they are considered to carry little meaningful information. These words typically include common grammatical terms, such as articles (e.g., "the," "a," "an"), prepositions (e.g., "in," "on," "under"), conjunctions (e.g., "and," "but," "or"), and other frequently occurring words. The purpose of removing stopwords is to focus on the more significant and distinctive words in a document, improving the efficiency of text analysis and reducing noise in the data.

In [3]:
print(list(stopwords.words("english")))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
# Tokenize the text
words = word_tokenize(text)

# Filter the stopwords
filtered_words = [word for word in words if word.lower() not in stopwords.words("english")]

Print the preprocessed text,

In [6]:
print(' '.join(filtered_words))

Well 'm sure story nad seem biased . disagree statement U.S. Media ruin Israels reputation . rediculous . U.S. media pro-israeli media world . lived Europe realize incidences one described letter occured . U.S. media whole seem try ignore . U.S. subsidizing Israels existance Europeans ( least degree ) . think might reason report clearly atrocities . shame Austria , daily reports inhuman acts commited Israeli soldiers blessing received Government makes Holocaust guilt go away . , look Jews treating races got power . unfortunate .


## 3.3 Punctuation Removal

Punctuation removal is a preprocessing step that involves eliminating punctuation marks from text data. Punctuation, such as periods, commas, exclamation points, and question marks, often doesn't contribute directly to the meaning of words and can introduce noise during text analysis.This process can be implemented using string manipulation techniques or regular expressions, where each character is examined, and punctuation marks are selectively excluded. The resulting text without punctuation is cleaner, making it more suitable for subsequent language processing tasks that rely on the semantic content of the text rather than its grammatical structure.

In [7]:
import string

# Remove punctuation using string manipulation
text_no_punct = [char for char in filtered_words if char not in string.punctuation]


Print the preprocessed text,

In [8]:
print(' '.join(text_no_punct))

Well 'm sure story nad seem biased disagree statement U.S. Media ruin Israels reputation rediculous U.S. media pro-israeli media world lived Europe realize incidences one described letter occured U.S. media whole seem try ignore U.S. subsidizing Israels existance Europeans least degree think might reason report clearly atrocities shame Austria daily reports inhuman acts commited Israeli soldiers blessing received Government makes Holocaust guilt go away look Jews treating races got power unfortunate


## 3.4 lemmatization

Lemmatization is an another fundamental technique that involves reducing words to their base or root forms, known as lemmas. Unlike stemming, which simply removes prefixes or suffixes to approximate the root form, lemmatization ensures that the transformed words are valid in the language. The process considers the context and grammatical structure of words, producing linguistically accurate lemmas. For instance, the lemmatization of words like "running" or "ran" would be "run." This normalization step is crucial in NLP tasks where maintaining the semantic integrity of words is essential, such as in information retrieval, text mining, and sentiment analysis. Lemmatization helps unify variations of words, reducing dimensionality and improving the accuracy of language processing algorithms by focusing on the essential meaning carried by words in their canonical forms.

In [9]:
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)

['', "' m", 'sure', 'story', '', 'seem', 'bias', 'disagree', 'statement', '', 'medium', '', 'israel', 'reputation', 'rediculous', '', 'medium', 'pro - israeli', 'medium', 'world', 'live', '', 'realize', 'incidence', '', 'describe', 'letter', 'occur', '', 'medium', 'whole', 'seem', 'try', '', '', 'subsidize', 'israel', 'existance', 'european', 'least', 'degree', 'think', '', 'reason', 'report', 'clearly', 'atrocity', 'shame', '', 'daily', 'report', 'inhuman', 'act', 'commit', 'israeli', 'soldier', 'blessing', 'receive', 'government', 'make', '', 'guilt', 'go', 'away', 'look', '', 'treat', 'race', 'get', 'power', 'unfortunate']


In [10]:
lemmatized_text = lemmatization(text_no_punct)
print(lemmatized_text)

['', "' m", 'sure', 'story', '', 'seem', 'bias', 'disagree', 'statement', '', 'medium', '', 'israel', 'reputation', 'rediculous', '', 'medium', 'pro - israeli', 'medium', 'world', 'live', '', 'realize', 'incidence', '', 'describe', 'letter', 'occur', '', 'medium', 'whole', 'seem', 'try', '', '', 'subsidize', 'israel', 'existance', 'european', 'least', 'degree', 'think', '', 'reason', 'report', 'clearly', 'atrocity', 'shame', '', 'daily', 'report', 'inhuman', 'act', 'commit', 'israeli', 'soldier', 'blessing', 'receive', 'government', 'make', '', 'guilt', 'go', 'away', 'look', '', 'treat', 'race', 'get', 'power', 'unfortunate']


## 3.5 gensim.utils.simple_preprocess function

The *gensim.utils.simple_preprocess* function is part of the Gensim library, specifically in the gensim.utils module. It is a utility function designed for simple text preprocessing. The main purpose of this function is to tokenize and preprocess a text by performing the following operations:

* Tokenization: It breaks down the input text into individual words or tokens.

* Lowercasing: It converts all tokens to lowercase. This helps in standardizing the text and avoids treating the same word in different cases as different entities.

* Removing Accent Marks (deacc=True): By default, the function removes accent marks from characters. This is useful for normalizing text, especially in scenarios where accent marks might not be relevant to the analysis.

In [13]:
import gensim

# Note that the input text cannot be a list type
processed_text = gensim.utils.simple_preprocess(" ".join(lemmatized_text), deacc=True)

In [14]:
print(processed_text)


['sure', 'story', 'seem', 'bias', 'disagree', 'statement', 'medium', 'israel', 'reputation', 'rediculous', 'medium', 'pro', 'israeli', 'medium', 'world', 'live', 'realize', 'incidence', 'describe', 'letter', 'occur', 'medium', 'whole', 'seem', 'try', 'subsidize', 'israel', 'existance', 'european', 'least', 'degree', 'think', 'reason', 'report', 'clearly', 'atrocity', 'shame', 'daily', 'report', 'inhuman', 'act', 'commit', 'israeli', 'soldier', 'blessing', 'receive', 'government', 'make', 'guilt', 'go', 'away', 'look', 'treat', 'race', 'get', 'power', 'unfortunate']


## 3.6 Topic Modeling


Imagine you have a massive collection of documents, like articles, blog posts, or research papers. The challenge is to understand what these documents are about without reading each one individually. This is where topic modeling comes in. In the context of text, a topic is a set of words that tend to occur together. For example, if you're reading about cars, words like "engine," "speed," and "fuel efficiency" might frequently appear together, indicating the topic is related to automobiles. Topic modeling is a technique in the field of natural language processing (NLP) that helps us automatically discover hidden themes or topics in a large collection of text documents. Instead of reading every document, the idea is to let the computer analyze the words and find patterns, grouping them into topics.

Topic models find useful applications in literature study by providing sophisticated tools to analyze, categorize, and derive insights from vast collections of texts. One significant application is in literature reviews, where researchers can employ topic modeling to efficiently identify and synthesize key themes across numerous academic papers, streamlining the process of understanding existing work on a particular subject. Moreover, topic modeling facilitates genre and style analysis, helping scholars discern distinguishing features of different literary genres or track the evolving styles of specific authors. It aids in character and plot analysis, unraveling the thematic intricacies within novels or plays. Comparative literature studies benefit from topic modeling by identifying common themes across different cultures and languages, fostering cross-cultural literary analysis. Additionally, the technique is instrumental in recognizing literary movements, mapping the evolution of trends within the field.

The most common method for topic modeling is Latent Dirichlet Allocation (LDA). In the following section, we are going to show you how to run a simple LDA on a given corpus.

In [16]:
import gensim.corpora as corpora
import spacy

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [17]:
lemmatized_texts = lemmatization(documents)

def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)


In [18]:
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print(corpus[0][0:20])

word = id2word[[0][:1][0]]
print(word)
print(len(corpus))

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)]
act
11314


In [19]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=8,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")

In [20]:
lda_model.print_topics(num_topics=8, num_words=10)

[(0,
  '0.022*"say" + 0.021*"people" + 0.011*"other" + 0.010*"believe" + 0.008*"human" + 0.008*"word" + 0.007*"evidence" + 0.007*"true" + 0.006*"world" + 0.006*"life"'),
 (1,
  '0.704*"ax" + 0.049*"max" + 0.008*"di" + 0.007*"pl" + 0.004*"tm" + 0.004*"wm" + 0.003*"ei" + 0.003*"bxn" + 0.003*"okz" + 0.003*"tq"'),
 (2,
  '0.033*"key" + 0.022*"use" + 0.021*"government" + 0.019*"law" + 0.012*"system" + 0.012*"public" + 0.011*"action" + 0.010*"judge" + 0.008*"secret" + 0.008*"drug"'),
 (3,
  '0.010*"issue" + 0.009*"group" + 0.008*"value" + 0.008*"report" + 0.007*"state" + 0.007*"provide" + 0.007*"rule" + 0.006*"new" + 0.006*"space" + 0.005*"israeli"'),
 (4,
  '0.024*"drive" + 0.019*"use" + 0.017*"car" + 0.014*"buy" + 0.014*"system" + 0.013*"card" + 0.012*"price" + 0.012*"disk" + 0.012*"new" + 0.011*"sell"'),
 (5,
  '0.015*"get" + 0.013*"know" + 0.012*"just" + 0.011*"make" + 0.010*"go" + 0.010*"so" + 0.010*"time" + 0.010*"think" + 0.010*"more" + 0.010*"good"'),
 (6,
  '0.033*"use" + 0.023*"fil

Does LDA provides coherent topics? What insights do the results provide regarding the corpus?