<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_012_feature_extraction_from_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Feature Extraction from Text


### **Project Summary: Text Classification using TF-IDF with sklearn**
In this project, we’ll explore Natural Language Processing (NLP) fundamentals, focusing on transforming text data into useful features using TF-IDF (Term Frequency-Inverse Document Frequency). We’ll apply these transformations to a dataset (e.g., 20 Newsgroups or IMDb), building and evaluating a text classification model. This project will provide hands-on experience with converting text to numerical data, a crucial step in preparing text for machine learning models.

### **Sklearn Tools We’ll Use**
1. **`TfidfVectorizer`**:
   - This tool will help us convert raw text into a TF-IDF matrix in one step, covering tokenization, counting word occurrences, and weighting words based on their importance in the dataset.
2. **`TfidfTransformer`**:
   - We’ll experiment with this tool to apply TF-IDF weighting to an existing term count matrix (generated by `CountVectorizer`), giving us a deeper understanding of each step in text transformation.
3. **`CountVectorizer`**:
   - Used to generate a basic term-document matrix, it lets us explore the raw counts of terms in each document before applying TF-IDF weighting.

### **Major Lessons and Focus Areas**
1. **Understanding the TF-IDF Process**:
   - Learn how TF and IDF values are calculated and how they represent word importance across documents.
   - Explore how TF-IDF helps improve model performance by emphasizing distinctive words and reducing the noise from common terms.

2. **Converting Text Data to Numerical Features**:
   - Practice using `TfidfVectorizer` and `TfidfTransformer` to transform raw text into a usable feature matrix for machine learning.
   - Compare the output from `CountVectorizer` and `TfidfTransformer` to understand the effect of TF-IDF weighting.

3. **Building and Evaluating a Text Classification Model**:
   - Using the transformed features, we’ll build a basic classification model (e.g., Logistic Regression or Naive Bayes).
   - Evaluate the model’s performance to see how well TF-IDF features help in identifying categories or sentiment within the text.

4. **Exploring Practical Applications**:
   - Learn how TF-IDF can be applied to various NLP tasks beyond classification, like similarity search and topic clustering.

This project will give you hands-on experience with text preprocessing, feature extraction, and building predictive models, foundational skills for any NLP project.

## Datasets Descriptions

### 1. **20 Newsgroups Dataset (sklearn)**
   - **Description**: A classic dataset for text classification, containing around 20,000 newsgroup posts across 20 different categories (e.g., “sci.space,” “rec.sport.hockey,” “comp.graphics”).
   - **Why Use It**: It’s great for experimenting with TF-IDF and other text processing techniques due to its multi-class setup and diverse vocabulary.
   - **Loading the Dataset**:
     ```python
     from sklearn.datasets import fetch_20newsgroups

     # Load the dataset
     newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
     documents = newsgroups.data
     labels = newsgroups.target
     ```

### 2. **IMDb Dataset (Hugging Face)**
   - **Description**: Contains movie reviews along with sentiment labels (positive or negative). It’s widely used for sentiment analysis and other NLP tasks.
   - **Why Use It**: Sentiment analysis is a common application of TF-IDF, and this dataset provides a nice binary classification task.
   - **Loading the Dataset**:
     ```python
     from datasets import load_dataset

     # Load IMDb dataset
     dataset = load_dataset("imdb")
     documents = dataset["train"]["text"]
     labels = dataset["train"]["label"]
     ```

### 3. **AG News Dataset (Hugging Face)**
   - **Description**: A dataset of news articles categorized into four classes (World, Sports, Business, Sci/Tech).
   - **Why Use It**: This dataset allows you to experiment with TF-IDF for multi-class text classification, similar to the Newsgroups dataset but with fewer classes and a focus on news articles.
   - **Loading the Dataset**:
     ```python
     from datasets import load_dataset

     # Load AG News dataset
     dataset = load_dataset("ag_news")
     documents = dataset["train"]["text"]
     labels = dataset["train"]["label"]
     ```

### 4. **SMS Spam Collection Dataset (sklearn)**
   - **Description**: A small dataset of SMS messages labeled as “spam” or “ham” (not spam). It’s useful for experimenting with text classification on shorter text.
   - **Why Use It**: The simplicity and binary nature of this dataset make it perfect for beginners practicing TF-IDF and text classification.
   - **Loading the Dataset**:
     ```python
     from sklearn.datasets import fetch_openml

     # Load SMS spam dataset
     sms = fetch_openml("sms_spam", version=1, as_frame=True)
     documents = sms.data["text"]
     labels = sms.target
     ```

### Suggested Practice
Once you load a dataset, you can:
1. **Apply `CountVectorizer` or `TfidfVectorizer`** to convert the text data into a numerical form.
2. **Train a model** using sklearn’s classifiers like `LogisticRegression`, `NaiveBayes`, or `RandomForest`.
3. **Evaluate the performance** to see how TF-IDF features impact the model’s accuracy.

These datasets are well-suited for experimenting with `TfidfVectorizer` and `TfidfTransformer`, and the text-based structure makes them ideal for NLP tasks like text classification, clustering, and similarity analysis.

<div class="alert alert-info" style="margin: 20px">**That's the end of the first section.**
<br>In the next section we'll use scikit-learn to perform a real-life analysis.</div>

In [None]:
# from datasets import load_dataset

# # Load IMDb dataset
# dataset = load_dataset("imdb")
# documents = dataset["train"]["text"]
# labels = dataset["train"]["label"]

In [None]:
# from sklearn.datasets import fetch_openml

# # Load SMS spam dataset
# sms = fetch_openml("sms_spam", version=1, as_frame=True)
# documents = sms.data["text"]
# labels = sms.target

In [None]:
# from sklearn.datasets import fetch_openml

# # Load SMS spam dataset
# sms = fetch_openml("sms_spam", version=1, as_frame=True)
# documents = sms.data["text"]
# labels = sms.target



### 1. **Explore the Data**:
   - **Inspect the structure** of the dataset, understanding how documents and labels are organized.
   - Check the **first few entries** to get a sense of the text content and look for any pre-processing needs, like unwanted headers or special characters.

### 2. **Pre-process Text Data**:
   - **Text Cleaning**: Remove any metadata (headers, footers, quotes) if it’s not already done. This can help reduce noise.
   - **Lowercasing**: Convert text to lowercase for uniformity, as TF-IDF treats “Machine” and “machine” as different words.
   - **Optional**: Consider removing **stop words** (common words like "the," "is," etc.), although TF-IDF down-weights them naturally.

### 3. **Split Data into Training and Testing Sets**:
   - Since this is a classification task, **splitting** the data into training and testing sets early allows us to assess model performance on unseen data later.
   - Sklearn’s `train_test_split` function is useful here.

### 4. **Initialize CountVectorizer or TfidfVectorizer**:
   - **Choose your vectorizer**: If using `TfidfVectorizer`, we can directly transform the text data into a TF-IDF matrix. If you want to understand intermediate steps, start with `CountVectorizer`, then apply `TfidfTransformer`.
   - **Set parameters** like `max_features` (to limit the vocabulary size), `ngram_range` (e.g., (1, 2) for unigrams and bigrams), and `stop_words='english'` (optional) to remove stop words.

### 5. **Transform Text Data into a TF-IDF Matrix**:
   - Once the vectorizer is ready, **fit and transform** the training data to create the TF-IDF feature matrix.
   - Transform the test data using the same vectorizer (fit only on the training data to avoid data leakage).


### Preprocessing Manually
The `fetch_20newsgroups` dataset makes it really convenient by providing options to remove headers, footers, and quotes, as well as handling stop words and lowercasing directly in `TfidfVectorizer`. This setup streamlines preprocessing.

In most real-world scenarios, however, you would typically need to define your own functions to perform each of these steps. Here’s a quick look at what these custom preprocessing steps might look like:

1. **Lowercasing**:
   ```python
   def to_lowercase(text):
       return text.lower()
   ```

2. **Removing Headers, Footers, Quotes**:
   - Headers and footers can usually be stripped if they follow a pattern (e.g., metadata or signatures in emails). Otherwise, it may require regex or specific removal functions.
   - Quotes and other characters are often removed with regex or libraries like `re`.

3. **Stop Words Removal**:
   - Using NLTK or a custom list to remove common words:
     ```python
     from nltk.corpus import stopwords
     stop_words = set(stopwords.words('english'))
     
     def remove_stopwords(text):
         return " ".join(word for word in text.split() if word not in stop_words)
     ```

4. **Combined Preprocessing Function**:
   - A single function can incorporate all steps for consistency:
     ```python
     import re
     
     def preprocess_text(text):
         # Lowercase
         text = text.lower()
         # Remove special characters
         text = re.sub(r'[^a-zA-Z\s]', '', text)
         # Remove stop words
         text = " ".join(word for word in text.split() if word not in stop_words)
         return text
     ```



## Load Data : News Groups

In [4]:
from sklearn.datasets import fetch_20newsgroups

# Load the dataset
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
documents = newsgroups.data
labels = newsgroups.target

# Check the number of documents and unique labels
print("Number of documents:", len(documents))
print("Number of unique labels:", len(set(labels)))

# Preview a few sample documents
for i in range(3):
    print(f"Document {i+1}:\n{documents[i]}\n{'-'*80}")


Number of documents: 11314
Number of unique labels: 20
Document 1:
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
--------------------------------------------------------------------------------
Document 2:
A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requeste

Yes, downloading both the training and testing data from the Newsgroups dataset and keeping the original split provided is a good approach. This way, we can avoid any unintended data leakage by using the pre-defined train-test split. Here’s how we can set it up:

### Step 1: Download Both Train and Test Sets
We’ll load both the `train` and `test` subsets, and then set up `X_train`, `y_train`, `X_test`, and `y_test` directly from these splits.

### Step 2: Inspect the Data
Now, let’s check the structure of `X_train` and `X_test` as well as a few samples. This will give us an idea of any additional preprocessing needed.

This setup allows us to proceed with preprocessing, vectorization, and model training without having to split the data manually. Let me know if you’d like to continue to the next step with any additional cleaning or preprocessing.

In [7]:
from sklearn.datasets import fetch_20newsgroups

# Load both training and testing data
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Define training and testing data and labels
X_train, y_train = newsgroups_train.data, newsgroups_train.target
X_test, y_test = newsgroups_test.data, newsgroups_test.target

# Print basic information about the datasets
print("Number of training documents:", len(X_train))
print("Number of testing documents:", len(X_test))
print("Number of unique labels:", len(set(y_train)))

Number of training documents: 11314
Number of testing documents: 7532
Number of unique labels: 20


Now that we have our training and testing sets (`X_train`, `y_train`, `X_test`, `y_test`), the next steps involve text preprocessing and vectorization. Here’s what we’ll do next:

### Step 3: Text Preprocessing (Optional)
Based on the preview of your documents, we can decide on any additional preprocessing:
- **Lowercasing**: This step is often helpful to treat words like "Machine" and "machine" as the same term.
- **Removing Stop Words**: We can remove common words like "the," "is," and "and" using the `stop_words` parameter in `TfidfVectorizer`.
- **Other Cleaning**: If we notice any additional noise or special characters, we can handle that as well.

Let’s proceed with lowercasing and removing stop words to create a cleaner text representation.

### Step 4: Initialize the TfidfVectorizer
Using `TfidfVectorizer`, we’ll transform the text data into a TF-IDF matrix. This matrix will represent each document as a set of weighted word features based on their frequency and importance.


### Step 5: Check the Shape and Sample of the TF-IDF Matrix
It’s helpful to inspect the resulting TF-IDF matrix briefly, which gives us an idea of the structure and confirms that our transformation was successful. At this point, we have a clean and ready-to-use TF-IDF representation of our text data.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with basic parameters
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', max_features=5000)  # max_features limits vocab size

# Fit and transform the training data to create the TF-IDF matrix
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data using the same vectorizer (important to use the same fit)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [9]:
# Check the shape of the TF-IDF matrices
print("TF-IDF Matrix for X_train:", X_train_tfidf.shape)
print("TF-IDF Matrix for X_test:", X_test_tfidf.shape)

# View a few feature names to understand the vocabulary (first 10 words)
print("Sample feature names:", tfidf_vectorizer.get_feature_names_out()[:10])

TF-IDF Matrix for X_train: (11314, 5000)
TF-IDF Matrix for X_test: (7532, 5000)
Sample feature names: ['00' '000' '01' '02' '03' '04' '040' '05' '06' '07']


### TfidfVectorizer Explained

The `TfidfVectorizer` in sklearn is a powerful tool for converting text data into a numerical representation, specifically TF-IDF (Term Frequency-Inverse Document Frequency) values. It’s commonly used in NLP to transform raw text into features that a machine learning model can use.

Here’s a breakdown of how `TfidfVectorizer` works and how you can use it:

### 1. Term Frequency (TF)
   - TF represents how frequently a term appears in a document.
   - It’s calculated as the ratio of the term's occurrences to the total number of terms in the document.

### 2. Inverse Document Frequency (IDF)
   - IDF measures how unique or rare a term is across all documents.
   - Rare terms get higher scores, as they’re more informative, while common terms get lower scores.

### 3. TF-IDF Calculation
   - TF-IDF combines TF and IDF: it assigns a weight to each term based on both its frequency in a single document and its rarity across all documents.
   - The formula is typically: **TF-IDF(term, doc) = TF(term) * IDF(term)**

### Key Points to Remember:
   - `fit_transform`: Fits the vectorizer on the data and transforms it to a TF-IDF matrix.
   - `get_feature_names_out()`: Gets the vocabulary (unique terms) as features.
   - TF-IDF values are higher for terms that are important within a document but less common across the entire dataset.

This vectorization is useful when building machine learning models, as it converts text into numerical data, capturing the importance of each word in a way that a model can understand. Let me know if you'd like to see more examples or explanations on tuning parameters in `TfidfVectorizer`!

1. **Rows as Documents**:
   - Each row in the TF-IDF matrix indeed represents a single document.
   - In your example, there are 3 documents in the `documents` list, so the TF-IDF matrix has 3 rows, one for each document.

2. **Sparse Matrix**:
   - The matrix is sparse because most words in the combined vocabulary don’t appear in every document. For any word that’s absent in a document, the TF-IDF score is zero.
   - So, you end up with mostly zeros and a few non-zero values where words from the vocabulary actually appear in a document.
   
3. **TF-IDF Scores**:
   - Each non-zero entry in the matrix is the TF-IDF score for a term within a specific document, calculated as **TF(term, doc) * IDF(term)**.
   - These scores give extra weight to terms that are more informative (frequent in a document but rare across the corpus), helping to capture what’s distinctive about each document.

4. **Boosting Model Performance**:
   - By transforming text into TF-IDF scores, we create features that emphasize informative words, which improves the model’s ability to distinguish between documents.
   - These scores make the representation richer and more meaningful for models, allowing them to focus on words that carry weight rather than common, generic terms, ultimately improving predictive power.

Using TF-IDF, models can better understand what’s important in each document and identify patterns related to specific words or themes, which can significantly enhance text classification or similarity-based predictions.

### Improving Predictive Power

Here’s how TF-IDF enhances predictive power by giving higher weights to informative words, and why this richer representation improves model predictions:

### 1. Reducing Noise from Common Words
   - Without TF-IDF, words like “the,” “is,” or “and” (often called *stop words*) might appear frequently across many documents but don’t convey specific meaning related to the content or themes.
   - TF-IDF assigns low weights to these common words, effectively reducing their importance in the model.
   - By down-weighting non-informative words, the model can focus on terms that actually differentiate documents, such as “sklearn,” “technique,” or “processing” in your example.

### 2. Emphasizing Distinctive Words
   - TF-IDF gives higher weights to words that are frequent in a specific document but rare across other documents.
   - These unique words, often specific to the document’s main topic or theme, are then more prominent in the document’s TF-IDF vector representation.
   - For example, if “technique” appears frequently in one document but rarely in others, it will have a high TF-IDF score, signaling to the model that it’s an important feature for that document.

### 3. Enabling Pattern Recognition
   - By using TF-IDF, models can detect patterns between the presence and absence of meaningful words and the document's class or label.
   - For example, in a text classification task, if “machine learning” is highly weighted in documents labeled as “tech,” the model learns that a high score for “machine learning” correlates with the “tech” label.
   - Over time, as the model trains, it adjusts its internal parameters to “notice” these high-weighted words more and use them as clues for classifying documents.

### 4. Improving Predictive Power through Feature Weights
   - When a model sees new documents, the TF-IDF representation highlights key terms based on the weights learned during training.
   - For instance, if the model learned that “sklearn” or “TF-IDF” are strong indicators of technical content, it will use their high TF-IDF values in new documents to predict a “tech” label.
   - Higher-value words from TF-IDF can thus steer the model’s prediction towards the correct label based on learned word associations, enhancing accuracy.

In short, TF-IDF helps the model by:
- **Filtering out noise** from common terms.
- **Highlighting distinctive terms** that capture the essence of each document.
- **Learning meaningful patterns** based on these distinctive terms, which improves its ability to generalize and make accurate predictions on new data.

So, the TF-IDF values effectively act as signals that guide the model’s learning and decision-making, resulting in improved predictive power, especially in tasks like text classification or similarity search.



### Beyone Binary Classification
If our corpus focuses on documents about machine learning, it would certainly seem like it’s “tech”-biased. But in real applications, there’s a broader scope, and TF-IDF helps the model capture nuanced distinctions across a variety of categories, not just a binary “tech” or “not tech.”

### Here’s how TF-IDF works across diverse topics, making it helpful for a range of tasks:

1. **Handling Multiple Classes in Text Classification**:
   - TF-IDF is effective for multi-class classification, where each document could belong to any one of multiple categories (e.g., “tech,” “sports,” “health,” “finance,” etc.).
   - Each class may have its own distinctive terms. For instance:
     - “machine learning” or “sklearn” might signal “tech.”
     - “fitness” or “diet” might signal “health.”
     - “market” or “stocks” might signal “finance.”
   - By learning TF-IDF patterns for each class, the model can distinguish documents based on vocabulary patterns and classify them into their respective categories rather than just a binary “tech” or “not tech.”

2. **Helping with Sentiment Analysis**:
   - TF-IDF can assist in identifying the sentiment behind a document by capturing words commonly associated with positive or negative sentiment.
   - For instance, in reviews, terms like “excellent” or “terrible” will have high weights for positive or negative sentiments, allowing the model to pick up sentiment distinctions.

3. **Improving Document Similarity Search**:
   - TF-IDF is widely used in recommendation systems and search engines to measure similarity between documents.
   - If a user searches for “machine learning techniques,” TF-IDF can help rank documents that are most relevant to this query based on shared high-weight words.
   - This allows users to find content that is closely aligned with their interests or needs, as documents with similar high TF-IDF values are considered more relevant.

4. **Assisting with Topic Modeling**:
   - In unsupervised learning, TF-IDF values can help reveal hidden topics within a set of documents.
   - For example, by clustering documents with high TF-IDF scores for specific terms, we might find topic clusters such as “AI and machine learning,” “health and fitness,” or “investment strategies.”
   - These clusters enable the organization of vast amounts of text data into meaningful groups.

### Beyond Simple Binary Classification
So, while a simple “tech” vs. “not tech” classification might seem limited, TF-IDF shines in complex, real-world tasks where there are multiple categories, intricate similarities, or subtle distinctions in the text. It gives models a structured way to leverage the distinct language used across a diverse set of documents, enabling everything from advanced text categorization to precise content recommendations.

Now that we have our preprocessed text data transformed into a TF-IDF matrix, the next step is to build and train a classification model using this matrix as input. Here’s a structured approach:

### Step 6: Build a Classification Model
To start, let’s use a simple model like **Logistic Regression** or **Naive Bayes**. Both are effective and often perform well with text data, especially when using TF-IDF representations.

### Step 7: Evaluate the Model
Now, let’s evaluate the model’s performance on the test set. We’ll use metrics like **accuracy**, **precision**, **recall**, and **F1 score** to understand how well the model is doing.

### Step 8: Analyze the Results
- **Interpret the Classification Report**: Look at precision, recall, and F1 scores for each class to see if there are any categories where the model struggles.
- **Identify Areas for Improvement**: Based on the results, we might try other models, tune hyperparameters, or explore different vectorization options (like adjusting `max_features` or `ngram_range` in `TfidfVectorizer`).

This should give us a solid starting point for understanding model performance with TF-IDF-transformed text data. Let me know if you’d like to move forward with additional models or adjustments based on these results!

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Initialize the classifier (choose one)
log_reg_model = LogisticRegression(max_iter=1000)
nb_model = MultinomialNB()

# Fit the model on the training data (use either model)
log_reg_model.fit(X_train_tfidf, y_train)
# nb_model.fit(X_train_tfidf, y_train)  # Uncomment this line if using Naive Bayes

### Predict
In this project, we are predicting the **category or topic** of each document in the Newsgroups dataset. The dataset contains text documents grouped into 20 different topics (or classes), and our model’s goal is to classify each document into one of these topics based on its content.

### Categories in the Newsgroups Dataset
The 20 Newsgroups dataset includes a variety of categories such as:
- **Computer-related**: comp.graphics, comp.sys.mac.hardware, comp.windows.x, etc.
- **Recreation**: rec.autos, rec.sport.hockey, rec.sport.baseball, etc.
- **Science**: sci.crypt, sci.electronics, sci.med, sci.space, etc.
- **Politics**: talk.politics.guns, talk.politics.mideast, talk.politics.misc, etc.

These categories cover a range of topics, making it ideal for multi-class text classification.

### Model’s Objective
The model’s task is to **analyze the content of each document** and predict which of the 20 categories it belongs to. For instance:
- A document with many terms like “graphics,” “computer,” and “file formats” might be classified under **comp.graphics**.
- Another with words like “medicine,” “doctor,” and “treatment” might be classified under **sci.med**.

### Example
After training, if we pass in a document from the test set, the model will analyze its TF-IDF representation and predict one of the 20 classes. The result might look something like this:

- **Input Document**: “I’m working on some cool graphics animations using OpenGL.”
- **Predicted Category**: `comp.graphics`

This classification task is common in NLP, where the goal is to categorize text into predefined topics, enabling applications such as topic detection, email filtering, and document organization.

In [12]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = log_reg_model.predict(X_test_tfidf)  # or nb_model.predict(X_test_tfidf)

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.44      0.43      0.44       319
           1       0.58      0.65      0.62       389
           2       0.64      0.61      0.62       394
           3       0.63      0.59      0.61       392
           4       0.70      0.64      0.67       385
           5       0.80      0.69      0.74       395
           6       0.75      0.76      0.75       390
           7       0.67      0.66      0.66       396
           8       0.45      0.75      0.56       398
           9       0.77      0.75      0.76       397
          10       0.87      0.84      0.86       399
          11       0.86      0.65      0.74       396
          12       0.50      0.57      0.54       393
          13       0.71      0.72      0.71       396
          14       0.69      0.69      0.69       394
          15       0.63      0.75      0.68       398
          16       0.53      0.63      0.58       364
   

## IMDb Dataset Pipeline
Creating a pipeline will streamline the preprocessing, vectorization, and modeling steps, making it reusable for other datasets. Here’s how we can set up a pipeline in sklearn to handle these steps consistently across datasets, starting with the IMDb dataset.

### Step 1: Load the IMDb Dataset
Let’s load the IMDb dataset and set up the features (`documents`) and target labels (`labels`).

### Step 2: Create an sklearn Pipeline
An sklearn pipeline allows us to chain together preprocessing, vectorization, and modeling steps. This makes it easy to apply the same process to both training and test data and even switch datasets with minimal modification.

### Step 3: Evaluate the Pipeline
With the pipeline set up, we can easily apply it to new data for prediction and evaluation.

### Step 4: Reusing the Pipeline for Other Datasets
This pipeline is now reusable. To apply it to another dataset (e.g., AG News), simply load the new data, split it, and call `pipeline.fit()` and `pipeline.predict()` on the new dataset.

### Benefits of the Pipeline
- **Consistency**: Each dataset will go through the same steps, ensuring consistency in preprocessing, vectorization, and modeling.
- **Reusability**: You can apply this pipeline to multiple datasets with minimal code adjustments.
- **Modularity**: Any component in the pipeline (e.g., `TfidfVectorizer` or `LogisticRegression`) can be swapped out easily, making it adaptable for experimentation.

This pipeline is a flexible way to standardize the text classification process across different datasets.

## Max Features
Setting `max_features` in `TfidfVectorizer` controls the size of the vocabulary by limiting the number of terms used to represent each document. Here’s why this parameter is useful, especially in NLP tasks like text classification:

### 1. **Reducing Model Complexity**
   - Text datasets often contain thousands or even tens of thousands of unique words.
   - Including all words as features can make the TF-IDF matrix very large and sparse, leading to high-dimensional data, which can be computationally expensive and harder for models to process.
   - Limiting the vocabulary size with `max_features` reduces the dimensionality of the feature space, helping the model train faster without significant loss of accuracy.

### 2. **Focusing on Informative Words**
   - By default, `TfidfVectorizer` ranks words by their term frequency and keeps the `max_features` most common terms.
   - This approach emphasizes frequently occurring, informative words, while less common words (which are often noisy or irrelevant) are ignored.
   - For instance, in a dataset like IMDb reviews, focusing on the 5,000 most common words is often sufficient to capture the overall sentiment of reviews, while rare words may not add much predictive power.

### 3. **Avoiding Overfitting**
   - Having too many features can lead to overfitting, especially when using smaller datasets.
   - Limiting `max_features` helps generalize the model by focusing on a core set of meaningful terms, avoiding over-reliance on rare, dataset-specific words that may not generalize well to new data.

### Choosing the `max_features` Value
   - Setting `max_features` to a higher value (e.g., 10,000 or 20,000) may capture more nuances in the text but increases computational cost.
   - Lower values (like 1,000–5,000) are often enough for general text classification tasks and keep the model efficient.

In short, `max_features` helps balance performance and efficiency, allowing the model to focus on the most informative words without overburdening the pipeline with unnecessary terms.


In [15]:
# !pip install datasets

In [16]:
from datasets import load_dataset

# Load IMDb dataset
dataset = load_dataset("imdb")
documents = dataset["train"]["text"]
labels = dataset["train"]["label"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split IMDb dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english', max_features=5000)),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

### Predict
With the IMDb dataset, we’re predicting **sentiment** of movie reviews. Each review is labeled as either:

- **0**: Negative sentiment
- **1**: Positive sentiment

### Objective
The goal is to classify each movie review as either positive or negative based on its content.

### Example
For instance:
- **Input Review**: “This movie was absolutely fantastic! The acting was top-notch, and the storyline kept me engaged till the end.”
  - **Predicted Sentiment**: **1** (Positive)

- **Input Review**: “The plot was boring and predictable. I wouldn't recommend this movie.”
  - **Predicted Sentiment**: **0** (Negative)

This is a binary classification problem, where the model learns patterns in the text to distinguish between positive and negative reviews. Using TF-IDF helps capture important words and phrases associated with each sentiment, which can significantly improve classification accuracy.

In [18]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the classification report and accuracy
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.87      0.88      2515
           1       0.87      0.89      0.88      2485

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000

Accuracy: 0.8814


## Spam | Ham

Here’s a summary of the **SMS Spam Collection** dataset and the prediction task:

### Dataset Overview
The SMS Spam Collection dataset is a text dataset containing **SMS messages** labeled as either **spam** or **ham**:
- **Spam**: Messages that are unsolicited or promotional, like ads or scams.
- **Ham**: Regular (non-spam) messages that are genuine conversations.

### Structure
- **Text Messages (documents)**: Each SMS message is a string of text containing the content of the message.
- **Labels (target)**: Each message is labeled as:
  - **"spam"** if it’s a spam message.
  - **"ham"** if it’s a regular (non-spam) message.

### Objective
The goal is to build a model that can classify SMS messages as **spam or ham** based on their content. This is a **binary classification task**.

### Example Messages
- **Spam Message**: “You have won a $1000 Walmart gift card. Go to http://xyz.com to claim now.”
  - **Predicted Label**: **"spam"**

- **Ham Message**: “Hey! Are we still meeting at 5 PM for coffee?”
  - **Predicted Label**: **"ham"**

### Prediction Task
Our model will learn to distinguish between spam and ham messages based on the vocabulary and patterns in the text. For instance, words like “win,” “free,” “claim,” or links (http://) may indicate spam, while casual, conversational language is likely to indicate ham.

### Applications
This task mirrors real-world applications, such as spam filters in email services, which automatically sort spam messages into a separate folder, improving user experience by filtering out unwanted content.


In [20]:
from datasets import load_dataset

# Load SMS Spam dataset
dataset = load_dataset("sms_spam")
documents = dataset["train"]["sms"]
labels = dataset["train"]["label"]

README.md:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

### Predict
The `class_weight="balanced"` parameter in `LogisticRegression` is specifically designed to address **class imbalance** in datasets where one class significantly outnumbers the other. Here’s how it works and why it often improves performance in imbalanced data situations, like with spam detection:

### 1. **Class Imbalance in Datasets**
   - In many datasets, one class is much more common than the other. For example, in the SMS Spam dataset, "ham" messages usually outnumber "spam" messages.
   - When class imbalance exists, models often get biased toward the majority class because correctly predicting it more frequently will increase accuracy (at the cost of poor performance on the minority class).

### 2. **How `class_weight="balanced"` Works**
   - When `class_weight="balanced"`, the Logistic Regression model automatically adjusts the weights of each class inversely proportional to their frequencies.
   - This means it assigns a **higher weight to the minority class** (e.g., "spam") and a **lower weight to the majority class** (e.g., "ham").
   - The model is then "penalized" more for misclassifying the minority class, prompting it to learn features more relevant to that class and helping it recognize both classes more fairly.

### 3. **Impact on Precision, Recall, and F1-Score**
   - **Precision**: With the adjusted weights, the model becomes more cautious about predicting the minority class ("spam") and tends to be more accurate when it does predict it, leading to an improvement in precision.
   - **Recall**: By balancing the class weights, the model pays more attention to correctly identifying instances of the minority class, increasing recall (the percentage of true spam messages correctly identified).
   - **F1-Score**: Since F1-score is the harmonic mean of precision and recall, an improvement in both leads to a higher F1-score, indicating a better balance between accuracy in detecting both spam and ham messages.

### Why the Improvement Happens
Without `class_weight="balanced"`, the model can end up ignoring the minority class to achieve higher overall accuracy, resulting in poor recall and precision for that class. By balancing class weights, the model is forced to learn patterns associated with both classes more equitably, resulting in a model that’s more sensitive to the minority class and generally more robust in imbalanced scenarios.

This adjustment is especially effective in tasks like spam detection, fraud detection, and disease diagnosis, where one class (like spam or positive diagnoses) is often rare but critical to identify correctly.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from datasets import load_dataset

# Load SMS Spam dataset
dataset = load_dataset("sms_spam")
documents = dataset["train"]["sms"]
labels = dataset["train"]["label"]

# Optional: Check if predefined test split exists
# X_train, X_test, y_train, y_test = dataset["train"]["sms"], dataset["test"]["sms"], dataset["train"]["label"], dataset["test"]["label"]

# Manual train-test split if not predefined
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(lowercase=True, stop_words='english', max_features=5000)),
    ('classifier', LogisticRegression(max_iter=1000, class_weight="balanced"))  # class_weight="balanced" for handling imbalance
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the classification report and accuracy
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.98      0.99       954
           1       0.90      0.93      0.91       161

    accuracy                           0.97      1115
   macro avg       0.95      0.95      0.95      1115
weighted avg       0.98      0.97      0.98      1115

Accuracy: 0.9748878923766816


In [None]:
# from sklearn.datasets import fetch_openml

# # Load SMS spam dataset
# sms = fetch_openml("sms_spam", version=1, as_frame=True)
# documents = sms.data["text"]
# labels = sms.target