# Text Data Mining

Text data mining can be described as the process of extracting essential data from common language text. All the data that we generate via text messages, documents, emails, files are written in common language text. Text mining is primarily used to draw useful insights or patterns from such data.

<img src="https://i.imgur.com/mQvsNyf.png" width="400" height="350" class="center">


<br>

# Areas of text mining in data mining

These are the following area of text mining :

<br><br>

<img src="https://i.imgur.com/Y0Ql6J3.png" width="450" height="450" class="center">


<br><br>

- **Information Extraction:**<br>
The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction.

- **Natural Language Processing:**<br>
NLP stands for Natural language processing. Computer software can understand human language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The development of the NLP application is difficult because computers generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects.

- **Data Mining:**<br>
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision. Data mining tools can be used to resolve many business problems that have traditionally been too time-consuming.

- **Information Retrieval:**<br>
Information retrieval deals with retrieving useful data from data that is stored in our systems. Alternately, as an analogy, we can view search engines that happen on websites such as e-commerce sites or any other sites as part of information retrieval.


# Text Mining Process:

The text mining process incorporates the following steps to extract the data from the document.<br><br>

<img src="https://i.imgur.com/sDfE2o4.png" width="500" height="400">

- **Text transformation**<br>
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.

    - Bag of words
    - Vector Space
   
- **Text Pre-processing**<br>
Pre-processing is a significant task and a critical step in Text Mining, Natural Language Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-processing is used for extracting useful information and knowledge from unstructured text data. Information Retrieval (IR) is a matter of choosing which documents in a collection should be retrieved to fulfill the user's need.

- **Feature selection**<br>
Feature selection is a significant part of data mining. Feature selection can be defined as the process of reducing the input of processing or finding the essential information sources. The feature selection is also called variable selection.

- **Data Mining**<br>
Now, in this step, the text mining procedure merges with the conventional process. Classic Data Mining procedures are used in the structural database.

- **Evaluate**<br>
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.


- **Applications**<br>
These are the following text mining applications:

    - Risk Management
    - Customer Care Service
    - Business Intelligence
    - Social Media Analysis

# Natural Language Processing (NLP)

NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human's languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

## Applications of NLP

There are the following applications of NLP -

**1. Speech Recognition**<br>
Speech recognition is used for converting spoken words into text. It is used in applications, such as mobile, home automation, video recovery, dictating to Microsoft Word, voice biometrics, voice user interface, and so on.<br>

<img src="https://i.imgur.com/EatZjYU.png" width="300" height="100" class="center">


**2. Spam Detection**<br>
Spam detection is used to detect unwanted e-mails getting to a user's inbox.<br>

<img src="https://i.imgur.com/U31OVdW.png" width="400" class="center">

**3. Sentiment Analysis**<br>
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the attitude, behaviour, and emotional state of the sender. This application is implemented through a combination of NLP (Natural Language Processing) and statistics by assigning the values to the text (positive, negative, or natural), identify the mood of the context (happy, sad, angry, etc.)

<img src="https://i.imgur.com/AYd74N2.png" width="400" height="500" class="center">

**4. Spelling correction**<br>
Microsoft Corporation provides word processor software like MS-word, PowerPoint for the spelling correction.

<img src="https://i.imgur.com/0Ir3LFQ.png" width="400" height="400" class="center">

**5. Chatbot**<br>
Implementing the Chatbot is one of the important applications of NLP. It is used by many companies to provide the customer's chat services.

<img src="https://i.imgur.com/ixoPd2Z.png" width="300" height="400" class="center">



## Termonigies in NLP
These terms are commonly used in the context of natural language processing (NLP) and text analysis:

1. **Corpus:**
   - A corpus refers to a large and structured collection of text documents. It can include various types of written or spoken texts, such as books, articles, transcripts, web pages, and more. Corpora (plural of corpus) are used for linguistic analysis, language modeling, and training NLP models.

2. **Document:**
   - In NLP, a document typically refers to a single unit of text, which can be as short as a sentence or as long as an entire book or article. Documents are the individual pieces of text that are analyzed or processed within a corpus.

3. **Vocabulary:**
   - Vocabulary, in the context of NLP, refers to the set of all unique words or tokens that appear in a corpus or a specific set of documents. It represents the entire lexicon of the language or text data being analyzed. Building a vocabulary is a common preprocessing step in NLP, and it's used to create word embeddings, feature vectors, or perform various text analysis tasks.

4. **Words:**
   - Words are the basic units of language, and in NLP, they refer to individual linguistic units that make up sentences and documents. Words can vary in length and can include single words (e.g., "cat"), multi-word phrases (e.g., "New York City"), or even symbols and punctuation. Analyzing words is a fundamental aspect of NLP, and it involves tasks such as tokenization (splitting text into words), stemming (reducing words to their base or root form), and more.

These terms are central to understanding and working with text data in natural language processing, and they form the foundation for various NLP techniques and applications, including text classification, sentiment analysis, machine translation, and more.

## How to build an NLP pipeline

There are the following steps to build an NLP pipeline -

### Step1: Sentence Segmentation

Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.

In [None]:
# pip install nltk --upgrade --quiet
import warnings
warnings.filterwarnings('ignore')

In [None]:
import nltk
nltk.download('punkt_tab')

#### nltk.download('punkt_tab')
nltk.download('punkt_tab') is a Python command used to download the "punkt" resource from the Natural Language Toolkit (NLTK) library. NLTK is a popular library used for natural language processing (NLP) tasks in Python. The "punkt" resource specifically contains a pre-trained tokenizer model used for tokenizing text into individual words or sentences. By downloading this resource, you can use NLTK's tokenizer functionality in your Python scripts or applications. Tokenization is an essential preprocessing step in many NLP tasks, as it breaks down text into smaller units for further analysis or processing.

In [None]:
corpus = """I don't drink coffee.
What a wonderful day!
Timon and Simba went for a party.
Shalini got internship!"""
print(corpus)

In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(corpus)

### Step2: Word Tokenization

Word Tokenizer is used to break the sentence into separate words or tokens.

In [None]:
from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)
print(words)

In [None]:
print(corpus.split())

In [None]:
from nltk.tokenize import TreebankWordTokenizer
t = TreebankWordTokenizer()
print(t.tokenize(corpus))

In [None]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

In [None]:
'very' in list(stopwords.words('english'))

In [None]:
len(stopwords.words('english'))

### Step4: Stemming

Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated and celebrating, all these words are originated with a single root word "celebrate." The big problem with stemming is that sometimes it produces the root word which may not have any meaning.

**For Example,** intelligence, intelligent, and intelligently, all these words are originated with a single root word "intelligen." In English, the word "intelligen" do not have any meaning.

In [None]:
import nltk
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [None]:
words = ['eating','eats','ate','writing','programming','programs',
        'history','finally','finalized','hyperparameter','sleeping']

In [None]:
for word in words:
    print(word + '----->'+stemming.stem(word))

In [None]:
stemming.stem('congratulations')

In [None]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

In [None]:
for word in words:
    print(word + '----->'+lancaster.stem(word))

In [None]:
from nltk.stem import RegexpStemmer
reg = RegexpStemmer('ing$|un|s$|able$',min=4)

In [None]:
reg.stem('eating')

In [None]:
reg.stem('cars')

In [None]:
reg.stem('unhappy')

In [None]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer('english',ignore_stopwords=False)

In [None]:
for word in words:
    print(word + '----->'+snowball.stem(word))

### Step 5: Lemmatization

Lemmatization is quite similar to the Stemming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.

**For example:** In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has a meaning.

In [None]:
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

In [None]:
'''
pos :-
noun - n
verb - v
adjective - a
adverb - r
'''
for word in words:
    print(word + '----->'+lemma.lemmatize(word,pos='v'))

In [None]:
stemming.stem('accordance')

In [None]:
lemma.lemmatize('accordance')

In [None]:
print(corpus)

In [None]:
from nltk.stem import PorterStemmer

sentences = nltk.sent_tokenize(corpus)
print(sentences)
stemmer = PorterStemmer()

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words
            if word.lower() not in list(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
print(sentences)

In [None]:
## Corpus ---> Tokenization ---> Stopwords Removal ---> Stemming  ----> Text to Vector

## Bag of Words (BoW) model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

**Review 1:** This movie is very scary and long.<br>
**Review 2:** This movie is not scary and is slow.<br>
**Review 3:** This movie is spooky and good.<br>

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’.

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:

<img src="https://i.imgur.com/Q8e9CXG.png" width="500" height="600" class="center">

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
paragraph = '''This movie is very scary and long.
This movie is not scary and is slow.
This movie is spooky and good.'''
ps = PorterStemmer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
print(corpus)

In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
print(X)

In [None]:
'This is one sentence'.split()

In [None]:
a = ['This', 'is', 'one', 'sentence']
a

In [None]:
' '.join(a)

In [None]:
import pandas as pd
df = pd.read_csv(r"C:\Users\piyus\Desktop\techis-ds-wiki-main\DS\Step 3-1 NLP\01_NLP\FakeNews.csv")
df

In [None]:
df = df[:1000]
df

In [None]:
df.info()

In [None]:
df.dropna(inplace=True)

In [None]:
df.isna().sum()

In [None]:
df

In [None]:
df = df.reset_index(drop=True)
df

In [None]:
df.label.value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x = df.label)
plt.title('Label Distribution')

In [None]:
X = df['text']
X.head()

In [None]:
y = df.label
y.head()

In [None]:
messages = df.copy()
messages.reset_index(inplace=True,drop=True)
messages.head()

In [None]:
messages['text'][0]

In [None]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

porter = PorterStemmer()
corpus = []
for i in range(len(messages)):
    news = re.sub('[^a-zA-Z]',' ',messages['text'][i])
    news = news.lower()
    news = news.split()
    news = [porter.stem(word) for word in news if not word in
           stopwords.words('english')]
    news = ' '.join(news)
    corpus.append(news)

In [None]:
corpus[:1]

- `max_features`: This parameter specifies the maximum number of features (i.e., words or n-grams) that will be considered. In this case, it's set to 5000, meaning only the top 5000 most frequent words or n-grams will be considered as features.

- `ngram_range`: This parameter specifies the range of n-grams to be extracted. An n-gram is a contiguous sequence of n items from a given sample of text or speech. Here, `ngram_range=(1,3)` indicates that unigrams (single words), bigrams (pairs of consecutive words), and trigrams (triplets of consecutive words) will be considered as features.

The `CountVectorizer` is a tool used in natural language processing (NLP) for converting a collection of text documents into a matrix of token counts. It essentially creates a sparse matrix where rows represent documents and columns represent features (words or n-grams), and the values in the matrix represent the frequency of each feature in each document. This matrix can then be used as input for machine learning algorithms.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000,ngram_range=(1,3))
X_cv = cv.fit_transform(corpus).toarray()

In [None]:
X_cv

In [None]:
df_cv = pd.concat([df,pd.DataFrame(X_cv)],axis=1)
df_cv

In [None]:
df_cv.drop(['id', 'title', 'author', 'text'],axis=1,inplace=True)

In [None]:
df_cv

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df_cv.drop(['label'],axis=1),
                                                df_cv['label'],test_size=0.33,random_state = 42)

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=0.01)
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)
y_pred

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
print(classification_report(y_test,y_pred))

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
print(accuracy_score(y_test,y_pred))

# TF-IDF
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multipling two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Term Frequency(TF) = (Number of reps of word in a sentence)/(Number of words in sentence)

Inverse Document Frequency(IDF) = log((number of sentence)/(Number of sentence containing the word))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()
print(X)

In [None]:
X[0][:200]

In [None]:
df_tf = pd.concat([df,pd.DataFrame(X)],axis=1)
df_tf

In [None]:
df_tf.drop(['id','title','author','text'],axis =1,inplace = True)

In [None]:
df_tf

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df_tf.drop(['label'],axis=1),
                                                df_tf['label'],test_size=0.3,random_state = 42)

In [None]:
from sklearn.naive_bayes import MultinomialNB
nv = MultinomialNB()
nv.fit(X_train,y_train)

In [None]:
y_pred = nv.predict(X_test)
y_pred

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_pred))

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
import seaborn as sns
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
from sklearn.naive_bayes import GaussianNB
nv = GaussianNB()
nv.fit(X_train,y_train)

In [None]:
y_pred = nv.predict(X_test)
y_pred

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
sns.heatmap(cm,annot=True,fmt='d')

In [None]:
print(classification_report(y_test,y_pred))

Naive Bayes is a family of probabilistic classification algorithms that are based on Bayes' theorem with the "naive" assumption of independence between features. There are several variants of the Naive Bayes algorithm, each with its own set of assumptions and characteristics. The most common types of Naive Bayes classifiers include:

1. Gaussian Naive Bayes:
   - Assumes that the continuous features follow a Gaussian (normal) distribution.
   - Suitable for continuous data where the values of features are real numbers.

2. Multinomial Naive Bayes:
   - Primarily used for text classification and dealing with discrete data, such as text data where each feature represents the frequency of a term (word) in a document.
   - Assumes that features follow a multinomial distribution.
   - Commonly used in natural language processing tasks like spam detection and document categorization.

3. Bernoulli Naive Bayes:
   - Used for binary or Boolean feature data, where each feature represents the presence or absence of a particular attribute.
   - Assumes that features follow a Bernoulli distribution.
   - Often used in text classification tasks where the presence or absence of words in a document is important (e.g., sentiment analysis).

4. Complement Naive Bayes:
   - An extension of Multinomial Naive Bayes, designed to address class imbalance in text classification problems.
   - It is particularly useful when some classes have very few training examples compared to others.

5. Categorical Naive Bayes:
   - Suitable for categorical data where features represent categories or labels.
   - Assumes that features follow a categorical distribution.
   - Used when dealing with data that doesn't have a natural ordering, such as nominal categorical variables.

6. Hybrid Naive Bayes:
   - In some cases, you may encounter hybrid versions that combine different Naive Bayes models to handle more complex data. These hybrids may mix Gaussian, Multinomial, or other Naive Bayes models as needed for the specific problem.

The choice of which Naive Bayes variant to use depends on the nature of your data and the problem you're trying to solve. It's essential to select the model that best aligns with the assumptions and characteristics of your dataset, as this can significantly impact the classifier's performance.