<a href="https://colab.research.google.com/github/hasnain112e/-Semi-Supervised-Learning-for-Document-Labeling/blob/main/Untitled181.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a Semi-Supervised Document Classifier using self-training to label unlabeled documents and improve performance. The workflow should: Simulate a small labeled dataset (positive/negative text samples) and a larger unlabeled dataset. Vectorize text using TF-IDF on both labeled and unlabeled data (transductive approach). Train a Logistic Regression classifier on the small labeled dataset. Predict and pseudo-label unlabeled data, selecting only high-confidence predictions (e.g., ≥90%). Retrain the model with both original labeled data and high-confidence pseudo-labeled data. Evaluate by printing pseudo labels for the unlabeled dataset.

## Simulate data

### Subtask:
Create a small labeled dataset and a larger unlabeled dataset.


**Reasoning**:
Create the labeled and unlabeled datasets as lists of strings and then convert them into pandas DataFrames as instructed.



In [1]:
import pandas as pd

# 1. Create labeled data
labeled_positive = [
    "This is a great product! I love it.",
    "Highly recommend this service.",
    "Excellent experience with this company.",
    "Very satisfied with the results."
]

labeled_negative = [
    "This is a terrible product. I regret buying it.",
    "Would not recommend this at all.",
    "Had a very bad experience.",
    "Completely unsatisfied."
]

# 2. Create unlabeled data (larger)
unlabeled_data = [
    "I had a mixed experience with this.",
    "The service was okay, but not great.",
    "The product arrived quickly.",
    "Need more time to evaluate.",
    "The price was reasonable.",
    "Looking for alternatives.",
    "Heard good things about this.",
    "Some features are missing.",
    "It works, but not as expected.",
    "Customer support was helpful.",
    "Considering returning this.",
    "Not sure how I feel about it.",
    "It's an average product.",
    "Might try something else next time.",
    "The packaging was damaged.",
    "Easy to use interface.",
    "Difficult to set up.",
    "Performance is inconsistent.",
    "Good value for money.",
    "Wish it had more features."
]

# 3. Create labeled DataFrame
labeled_texts = labeled_positive + labeled_negative
labeled_labels = [1] * len(labeled_positive) + [0] * len(labeled_negative)
labeled_df = pd.DataFrame({'text': labeled_texts, 'label': labeled_labels})

# 4. Create unlabeled DataFrame
unlabeled_df = pd.DataFrame({'text': unlabeled_data})

# Display the first few rows of each DataFrame to verify
display(labeled_df.head())
display(unlabeled_df.head())

Unnamed: 0,text,label
0,This is a great product! I love it.,1
1,Highly recommend this service.,1
2,Excellent experience with this company.,1
3,Very satisfied with the results.,1
4,This is a terrible product. I regret buying it.,0


Unnamed: 0,text
0,I had a mixed experience with this.
1,"The service was okay, but not great."
2,The product arrived quickly.
3,Need more time to evaluate.
4,The price was reasonable.


## Vectorize data

### Subtask:
Use TF-IDF to vectorize both labeled and unlabeled data.


**Reasoning**:
The task requires vectorizing the text data from both labeled and unlabeled dataframes using TF-IDF. This involves importing the necessary vectorizer, fitting it on the combined text data, and then transforming the text data from each dataframe separately.



In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Combine text data from both dataframes
combined_text = pd.concat([labeled_df['text'], unlabeled_df['text']], axis=0)

# Instantiate TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer on the combined text data
tfidf_vectorizer.fit(combined_text)

# Transform the text data in labeled_df and unlabeled_df separately
X_labeled = tfidf_vectorizer.transform(labeled_df['text'])
X_unlabeled = tfidf_vectorizer.transform(unlabeled_df['text'])

# Print the shapes of the transformed data to verify
print("Shape of labeled data after TF-IDF:", X_labeled.shape)
print("Shape of unlabeled data after TF-IDF:", X_unlabeled.shape)

Shape of labeled data after TF-IDF: (8, 83)
Shape of unlabeled data after TF-IDF: (20, 83)


## Initial model training

### Subtask:
Train a Logistic Regression classifier on the initial labeled dataset.


**Reasoning**:
Train a Logistic Regression classifier on the initial labeled dataset.



In [3]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression model
logistic_regression_model = LogisticRegression()

# Train the model using the X_labeled feature matrix and the labeled_df['label']
logistic_regression_model.fit(X_labeled, labeled_df['label'])

## Pseudo-labeling

### Subtask:
Predict labels for the unlabeled data using the trained model and select high-confidence predictions.


**Reasoning**:
Predict probabilities and labels for unlabeled data, select high-confidence predictions, and create a new DataFrame with these pseudo-labeled data points as instructed.



In [4]:
import numpy as np

# Predict class probabilities for the unlabeled data
unlabeled_probabilities = logistic_regression_model.predict_proba(X_unlabeled)

# Get the maximum probability for each prediction
max_probabilities = np.max(unlabeled_probabilities, axis=1)

# Predict the class labels for the unlabeled data
predicted_labels = logistic_regression_model.predict(X_unlabeled)

# Create a boolean mask for predictions with confidence >= 0.9
confidence_mask = max_probabilities >= 0.9

# Select the pseudo-labeled data points based on the confidence mask
pseudo_labeled_df = unlabeled_df[confidence_mask].copy()
pseudo_labeled_df['label'] = predicted_labels[confidence_mask]

# Display the pseudo-labeled DataFrame
display(pseudo_labeled_df)

Unnamed: 0,text,label


## Combine data

### Subtask:
Combine the original labeled data with the high-confidence pseudo-labeled data.


**Reasoning**:
Concatenate the original labeled data and the pseudo-labeled dataframes, store in a new dataframe, and display the head to verify.



In [5]:
# Concatenate the original labeled_df and the pseudo_labeled_df DataFrames.
combined_labeled_df = pd.concat([labeled_df, pseudo_labeled_df], ignore_index=True)

# Display the first few rows of the combined_labeled_df to verify the combination.
display(combined_labeled_df.head())

Unnamed: 0,text,label
0,This is a great product! I love it.,1
1,Highly recommend this service.,1
2,Excellent experience with this company.,1
3,Very satisfied with the results.,1
4,This is a terrible product. I regret buying it.,0


## Retrain model

### Subtask:
Retrain the Logistic Regression classifier on the combined dataset.


**Reasoning**:
Vectorize the combined labeled data and retrain the logistic regression model.



In [6]:
# Vectorize the text data in the combined_labeled_df DataFrame
X_combined = tfidf_vectorizer.transform(combined_labeled_df['text'])

# Train the logistic_regression_model using the vectorized combined data and labels
logistic_regression_model.fit(X_combined, combined_labeled_df['label'])

## Evaluate and display pseudo-labels

### Subtask:
Predict labels for the unlabeled data using the retrained model and display the pseudo-labels.


**Reasoning**:
Predict the labels for the unlabeled data using the retrained model, add these predictions as a 'pseudo_label' column to the `unlabeled_df`, and display the updated DataFrame.



In [7]:
# Predict labels for the unlabeled data using the retrained model
predicted_pseudo_labels = logistic_regression_model.predict(X_unlabeled)

# Add a new column named 'pseudo_label' to the original unlabeled_df DataFrame
unlabeled_df['pseudo_label'] = predicted_pseudo_labels

# Display the unlabeled_df DataFrame including the newly added 'pseudo_label' column
display(unlabeled_df)

Unnamed: 0,text,pseudo_label
0,I had a mixed experience with this.,1
1,"The service was okay, but not great.",1
2,The product arrived quickly.,1
3,Need more time to evaluate.,0
4,The price was reasonable.,1
5,Looking for alternatives.,0
6,Heard good things about this.,1
7,Some features are missing.,0
8,"It works, but not as expected.",0
9,Customer support was helpful.,0


## Summary:

### Data Analysis Key Findings

*   Initially, the labeled dataset contained 8 text samples (4 positive and 4 negative), while the unlabeled dataset contained 20 text samples.
*   TF-IDF vectorization was applied to both labeled and unlabeled data, resulting in a feature space of 83 features.
*   The initial Logistic Regression model trained on the small labeled dataset did not make any predictions on the unlabeled data with a confidence of 90% or higher, resulting in an empty pseudo-labeled dataset.
*   The model was retrained on the combined dataset (which in this case was only the original labeled data due to the lack of high-confidence pseudo-labels).
*   The retrained model predicted pseudo-labels for all 20 unlabeled text samples. The pseudo-labels assigned include both 0 and 1, indicating the model classified some unlabeled texts as negative and some as positive based on its training.

### Insights or Next Steps

*   The chosen confidence threshold for pseudo-labeling (0.9) was too high given the small initial labeled dataset, leading to no pseudo-labeled data being added. Consider lowering the confidence threshold or using a different semi-supervised technique more suitable for very limited initial labeled data.
*   Evaluate the performance of the retrained model on a separate test set (if available) to understand the actual impact of the pseudo-labeling process (even if no pseudo-labels were added in this specific run) and compare it to the performance of the initial model.
