<a href="https://colab.research.google.com/github/mohammadreza-mohammadi94/Data_Analysis_Machine_Learning/blob/master/Machine_Learning_Course/MLCourse_V330_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

## Libs

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Dataset

In [2]:
df = pd.read_csv("Restaurant_Reviews.tsv", delimiter="\t", quoting=3)
df

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


## Cleaning Texts

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, 1000):
    review = re.sub('[]')

## Creating Bag Of Words (BoW)

The provided code snippet is a part of a text preprocessing pipeline in Natural Language Processing (NLP). Here's an in-depth explanation of the code, including details about each method, library, and the specific processes involved:

```python
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)
```

### Explanation of Each Step:

#### Importing Libraries:
1. **`import re`**:
   - The `re` library provides support for regular expressions, which are used for searching and manipulating strings based on patterns.

2. **`import nltk`**:
   - The `nltk` library (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries.

3. **`nltk.download('stopwords')`**:
   - This command downloads the list of stopwords. Stopwords are common words (like "and", "the", "is") that are often removed from text because they don't carry significant meaning.

4. **`from nltk.corpus import stopwords`**:
   - This imports the stopwords corpus from NLTK, which contains a list of common stopwords for various languages.

5. **`from nltk.stem.porter import PorterStemmer`**:
   - This imports the Porter Stemmer from NLTK, which is an algorithm for reducing words to their root form (e.g., "running" to "run").

#### Initializing the Corpus:
- **`corpus = []`**:
  - This initializes an empty list called `corpus` which will store the cleaned and processed text data.

#### Processing Each Review:
- **`for i in range(0, 1000):`**:
  - This loop iterates over the first 1000 reviews in the dataset. The range can be adjusted based on the number of reviews available.

#### Cleaning the Text:
1. **`review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])`**:
   - The `re.sub` function replaces all non-alphabetic characters in the review with a space. This is done to remove punctuation, numbers, and special characters, leaving only letters.

2. **`review = review.lower()`**:
   - Converts all characters in the review to lowercase to ensure uniformity and avoid treating the same word with different cases as different words.

3. **`review = review.split()`**:
   - Splits the review into individual words based on spaces, creating a list of words.

#### Stemming and Removing Stopwords:
1. **`ps = PorterStemmer()`**:
   - Creates an instance of the PorterStemmer. This stemmer will be used to reduce words to their base or root form.

2. **`all_stopwords = stopwords.words('english')`**:
   - Retrieves the list of English stopwords from NLTK.

3. **`all_stopwords.remove('not')`**:
   - Removes the word 'not' from the stopwords list to retain its presence in the text since it can affect the sentiment and meaning of a sentence.

4. **`review = [ps.stem(word) for word in review if not word in set(all_stopwords)]`**:
   - This list comprehension stems each word in the review using the PorterStemmer and removes stopwords. Only words that are not in the stopwords list are kept.
   - **`ps.stem(word)`**: Stems each word.
   - **`if not word in set(all_stopwords)`**: Checks if the word is not a stopword.

#### Joining the Processed Words:
- **`review = ' '.join(review)`**:
  - Joins the list of processed words back into a single string with spaces separating the words.

#### Adding to the Corpus:
- **`corpus.append(review)`**:
  - Appends the cleaned and processed review to the `corpus` list.

### Summary of the Process:

1. **Data Cleaning**:
   - Remove non-alphabetic characters.
   - Convert text to lowercase.
   - Split text into words.

2. **Stemming and Stopword Removal**:
   - Stem words to their base form.
   - Remove common stopwords except for 'not'.

3. **Reconstructing Text**:
   - Join the processed words into a single string.
   - Store the processed string in the corpus.

### Reasons for Each Step:

- **Removing Non-Alphabetic Characters**: To focus on meaningful words and reduce noise in the data.
- **Lowercasing**: To ensure consistency and prevent treating the same word with different cases as different words.
- **Splitting**: To facilitate word-level processing such as stemming and stopword removal.
- **Stemming**: To reduce words to their root form, making the text more uniform and reducing the number of unique words the model needs to learn.
- **Stopword Removal**: To eliminate common words that do not contribute significant meaning, thus simplifying the text and improving model performance.
- **Rejoining Words**: To convert the processed list of words back into a format suitable for further NLP tasks like vectorization or model training.

By following these steps, the text data is transformed into a cleaner and more manageable form, making it more suitable for subsequent NLP tasks such as sentiment analysis or classification.