# **Problem Statement**  
## **22. Build a text classifier using TF-IDF and logistic regression.**

Build a text classification model using TF-IDF vectorization and Logistic Regression to classify text documents into predefined categories.

### Constraints & Example Inputs/Outputs

### Constraints
- Input: List of text documents
- Output: Predicted class labels
- Use:
    - TF-IDF for text vectorization
    - Logistic Regression for classification
- No deep learning models

### Example Input:
```python
Text: "I love this product"
Label: Positive

Text: "This is the worst experience"
Label: Negative

```

Expected Output:
```python
Input: "Amazing quality and fast delivery"
Prediction: Positive

```

### Solution Approach

**Step 1: Text Vectorization (TF-IDF)**
- Converts text → numerical vectors
- Penalizes frequent words (like the, is)
- Highlights important words

**Step 2: Model Selection (Logistic Regression)**
- Works well for linear text classification
- Fast, interpretable
- Outputs probabilities

**Step 3: Training & Prediction**
- Fit TF-IDF on training data
- Train logistic regression
- Predict on new text

### Solution Code

In [1]:
# Step 1: Import Libraries
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


In [2]:
# Step 2: Sample Dataset
texts = [
    "I love this product",
    "This is an amazing experience",
    "Absolutely fantastic service",
    "I hate this item",
    "This is the worst purchase",
    "Terrible experience overall"
]

labels = ["Positive", "Positive", "Positive", "Negative", "Negative", "Negative"]

df = pd.DataFrame({"text": texts, "label": labels})
df


Unnamed: 0,text,label
0,I love this product,Positive
1,This is an amazing experience,Positive
2,Absolutely fantastic service,Positive
3,I hate this item,Negative
4,This is the worst purchase,Negative
5,Terrible experience overall,Negative


In [5]:
# Approach 1: Brute Force Approach (Manual Steps)
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df["text"])

# Encode labels manually
label_map = {"Positive": 1, "Negative": 0}
y = df["label"].map(label_map)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)


In [6]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.0
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       2.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Alternative Solution

In [7]:
# Approach 2: Optimized Approach (Pipeline – Best Practice)
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("logreg", LogisticRegression())
])

pipeline.fit(df["text"], df["label"])


0,1,2
,steps,"[('tfidf', ...), ('logreg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [8]:
# Prediction Function
def predict_sentiment(text):
    return pipeline.predict([text])[0]


### Alternative Approaches

- CountVectorizer + Naive Bayes
- Word2Vec / GloVe embeddings
- Linear SVM
- Deep Learning (LSTM, Transformers)

➡️ TF-IDF + Logistic Regression is fast, strong baseline

### Test Case

In [9]:
# Test Case 1: Positive Sentence
text = "I really love this amazing product"
prediction = predict_sentiment(text)
print("Prediction:", prediction)

assert prediction == "Positive"
print("Test Case 1 Passed")


Prediction: Positive
Test Case 1 Passed


In [10]:
# Test Case 2: Negative Sentence
text = "This is a horrible and terrible experience"
prediction = predict_sentiment(text)
print("Prediction:", prediction)

assert prediction == "Negative"
print("Test Case 2 Passed")


Prediction: Negative
Test Case 2 Passed


In [11]:
# Test Case 3: Neutral-Looking Sentence
text = "The product arrived yesterday"
prediction = predict_sentiment(text)
print("Prediction:", prediction)


Prediction: Positive


In [12]:
# Test Case 4: Batch Prediction
texts = [
    "Fantastic quality",
    "Worst service ever",
    "I am very happy",
    "Completely disappointed"
]

pipeline.predict(texts)


array(['Positive', 'Negative', 'Positive', 'Positive'], dtype=object)

## Complexity Analysis

### TF-IDF Vectorization
- Time: O(n × d)
- Space: O(n × d)

### Logistic Regression
- Training: O(n × d)
- Prediction: O(d)

Where:

n = number of documents

d = vocabulary size

#### Thank You!!