# Text Pre-processing

In this notebook, we'll dive into **text pre-processing** in natural language processing. Text pre-processing is a critical step in preparing raw text data for analysis and is essential for many NLP tasks.

##### Why Text Cleaning?

Before we can perform any analysis or modeling, it's important to **clean the raw text**. Text cleaning involves transforming the unstructured, messy text data into a structured format that is easier to analyze. For example, we might need to remove **punctuation**, **stopwords** (like "and", "the"), and even convert words to **lowercase** to ensure consistency. The goal is to reduce noise and standardize the text to improve the quality of downstream analysis.


### 1. Basic concepts
Text cleaning is an essential preprocessing step for working with textual data. It involves a series of techniques to remove unwanted characters, standardize the text, and simplify the content. Below are some common text cleaning steps and an example using Python.

**Various Forms of Text Cleaning**

- **Lowercasing**: Converts all text to lowercase to ensure uniformity. This helps avoid treating words like "Python" and "python" as different.
- **Removing Punctuation**: Eliminates punctuation marks such as periods, commas, and exclamation points, which often do not add meaningful value for analysis.
- **Removing Numbers**: Removes numerical values if they are not relevant to the analysis, reducing noise in the dataset.
- **Tokenization**: Splits the text into individual units or tokens, typically words, which makes it easier to analyze their frequency and context.
- **Removing Stopwords**: Removes common words like "is", "and", "the" that do not carry significant meaning for many tasks.
- **Lemmatization**: Converts words to their base or dictionary form, ensuring that variations of a word are treated as the same (e.g., "running" becomes "run").
- **Stemming**: Similar to lemmatization, but more aggressive. Stemming reduces words to their root form by chopping off suffixes and prefixes, sometimes resulting in non-dictionary forms.

##### 1.1 The core concepts

In [None]:
# Install nltk if not installed yet
%pip install -U nltk

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download stopwords and WordNetLemmatizer from NLTK
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
txt = "Hello there! How are you doing today? The weather is gloomy and cold, but the Data Science track is awesome."

# Convert text to lowercase
cleaned_text = txt.lower()

# Remove punctuation
cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))

# Remove numbers
cleaned_text = re.sub(r'\d+', '', cleaned_text)

# Tokenize text
tokens = cleaned_text.split()

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

# Lemmatize tokens
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Final cleaned text
print(lemmatized_tokens)

##### 1.2 Regular Expressions (regex)

**Regular expressions (regex)** are powerful tools for text processing and cleaning. Regex provides a concise way to search, match, and manipulate strings based on specific patterns. Regex can be used to find particular types of text such as dates, emails, or to clean unwanted characters.

##### Common Regex Patterns
- **Digits (`\d`)**: Matches any digit.
- **Word Characters (`\w`)**: Matches any word character (letters, digits, and underscores).
- **Whitespace (`\s`)**: Matches any whitespace character (spaces, tabs, line breaks).
- **Quantifiers (`+`, `*`, `{n}`)**: Specify the number of occurrences (e.g., `\d+` matches one or more digits).
- **Character Set (`[abc]`)**: Matches any character in the set (e.g., `[a-z]` matches any lowercase letter).


In [None]:
import re

# Sample text with different types of noise
txt = "Contact me at email@example.com or call 123-456-7890. Visit https://example.com for more info! :)"

# Extract phone numbers
phone_numbers = re.findall(r'\d{3}-\d{3}-\d{4}', txt)
print("Phone Numbers:", phone_numbers)

# Extract phone numbers
email_addresses = re.findall(r'\S+@\S+', txt)
print("Email Addresses:", email_addresses)

# Remove special characters but keep the rest
cleaned_text = re.sub(r'[^\w\s-]', '', txt)

# Final cleaned text
print(txt)
print(cleaned_text)

### 2. Working with a Dataframe
In most cases you will load the data in a Dataframe. Let's take a look at the IMDb movie dataset that contains 9,000+ movies with plot.

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv('movies.csv')

# Create a new Dataframe with Title and Plot of the first 10 movies
df_text = df[['Title', 'Plot']].head(10)

df_text

In [None]:
# Text preprocessing function
def preprocess_text(text):
    # Converting to lowercase
    text = text.lower()
    
    # Removing punctuation and non-word characters
    text = re.sub(r'\W+', ' ', text)
    
    # Removing numbers
    text = re.sub(r'\d+', '', text)

    # Tokenize text
    tokens = text.split()

    # Remove stopwords
    stop_words = stopwords.words('english')
    text = [word for word in tokens if word not in stop_words]

    # you could add other options such as lemmatization, other stopwords
    # ...

    # Join tokens back into a sentence
    text = ' '.join(text)

    return text


# Preprocessing the Plots
df_text['Cleaned_text'] = df_text['Plot'].apply(preprocess_text)

# Examine results
df_text[['Plot','Cleaned_text']]


In [None]:
# Now let's scale up the text pre-processing
df_text = df[['Title', 'Plot']].head(250)

# Fill NaN
df_text['Plot'] = df_text['Plot'].fillna("")

# Preprocessing the Plots
df_text['Cleaned_text'] = df_text['Plot'].apply(preprocess_text)

# Save for other notebooks
df_text.to_csv('movies_cleaned.csv')


### 3. Translate to the Case
Go to the case and pre-process the news articles