# Text Classifier Based on Amazon Product Reviews

### The Dataset used here is taken from https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews

## includes and loging configuration

In [2]:
import pandas as pd
import numpy as np
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logging.info('Libraries imported successfully')

2025-03-19 15:03:17,519 - INFO - Libraries imported successfully


## Importing Training Dataset

In [3]:
logging.info('Loading training dataset from train.csv (no header assumed)')
df_train = pd.read_csv('Data_sets/train.csv', header=None)
logging.info(f'Training dataset shape: {df_train.shape}')
print(df_train.head())

2025-03-19 15:03:17,555 - INFO - Loading training dataset from train.csv (no header assumed)
2025-03-19 15:03:33,962 - INFO - Training dataset shape: (3600000, 3)


   0                                                  1  \
0  2                     Stuning even for the non-gamer   
1  2              The best soundtrack ever to anything.   
2  2                                           Amazing!   
3  2                               Excellent Soundtrack   
4  2  Remember, Pull Your Jaw Off The Floor After He...   

                                                   2  
0  This sound track was beautiful! It paints the ...  
1  I'm reading a lot of reviews saying that this ...  
2  This soundtrack is my favorite music of all ti...  
3  I truly like this soundtrack and I enjoy video...  
4  If you've played the game, you know how divine...  


## Processing Training Data

In [4]:
logging.info('Preparing training data: assuming first column is label and remaining columns are text features')

y_train = np.array(df_train.iloc[:, 0]) - 1

X_train_text = df_train.iloc[:, 1:]
X_train_combined = X_train_text.apply(lambda row: ' '.join(row.astype(str)), axis=1)

logging.info('Sample combined training text:')
print(X_train_combined.iloc[0])

X_train = X_train_combined
logging.info(f'Total training samples: {len(X_train)}')

2025-03-19 15:03:34,005 - INFO - Preparing training data: assuming first column is label and remaining columns are text features
2025-03-19 15:04:56,653 - INFO - Sample combined training text:
2025-03-19 15:04:56,659 - INFO - Total training samples: 3600000


Stuning even for the non-gamer This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


## Importing Test Dataset

In [5]:
logging.info('Loading test dataset from test.csv (no header assumed)')
df_test = pd.read_csv('Data_sets/test.csv', header=None)
logging.info(f'Test dataset shape: {df_test.shape}')
print(df_test.head())

2025-03-19 15:04:56,751 - INFO - Loading test dataset from test.csv (no header assumed)
2025-03-19 15:04:58,419 - INFO - Test dataset shape: (400000, 3)


   0                                                  1  \
0  2                                           Great CD   
1  2  One of the best game music soundtracks - for a...   
2  1                   Batteries died within a year ...   
3  2              works fine, but Maha Energy is better   
4  2                       Great for the non-audiophile   

                                                   2  
0  My lovely Pat has one of the GREAT voices of h...  
1  Despite the fact that I have only played a sma...  
2  I bought this charger in Jul 2003 and it worke...  
3  Check out Maha Energy's website. Their Powerex...  
4  Reviewed quite a bit of the combo players and ...  


## Test Data Preprocessing

In [6]:
logging.info('Preparing test data: assuming first column is label and remaining columns are text features')

y_test = np.array(df_test.iloc[:, 0]) - 1

X_test_text = df_test.iloc[:, 1:]
X_test_combined = X_test_text.apply(lambda row: ' '.join(row.astype(str)), axis=1)

logging.info('Sample combined test text:')
print(X_test_combined.iloc[0])

X_test = X_test_combined
logging.info(f'Total test samples: {len(X_test)}')

2025-03-19 15:04:58,458 - INFO - Preparing test data: assuming first column is label and remaining columns are text features
2025-03-19 15:05:04,680 - INFO - Sample combined test text:
2025-03-19 15:05:04,681 - INFO - Total test samples: 400000


Great CD My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"


## Making Vector Encoding

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

logging.info('Initializing TfidfVectorizer with advanced parameters')

vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=5000,
    ngram_range=(1,2),
    min_df=5,
    max_df=0.8
)

logging.info('Fitting vectorizer on training data')
X_train_vect = vectorizer.fit_transform(X_train)
logging.info(f'Vectorized training data shape: {X_train_vect.shape}')

logging.info('Transforming test data using the fitted vectorizer')
X_test_vect = vectorizer.transform(X_test)
logging.info(f'Vectorized test data shape: {X_test_vect.shape}')

print('Sample feature names:', vectorizer.get_feature_names_out()[:10])

2025-03-19 15:05:08,242 - INFO - Initializing TfidfVectorizer with advanced parameters
2025-03-19 15:05:08,243 - INFO - Fitting vectorizer on training data
2025-03-19 15:12:01,371 - INFO - Vectorized training data shape: (3600000, 5000)
2025-03-19 15:12:01,453 - INFO - Transforming test data using the fitted vectorizer
2025-03-19 15:12:25,575 - INFO - Vectorized test data shape: (400000, 5000)


Sample feature names: ['00' '000' '10' '10 minutes' '10 years' '100' '1000' '11' '12' '13']


## Applying logistic regression

In [8]:
from sklearn.linear_model import LogisticRegression

logging.info('Initializing Logistic Regression model with adjusted regularization')

model = LogisticRegression(max_iter=10000, C=10, random_state=42)

logging.info('Training model on vectorized training data')
model.fit(X_train_vect, y_train)
logging.info('Model training completed')

logging.info('Sample model coefficients:')
print(model.coef_[0][:10])

2025-03-19 15:12:26,673 - INFO - Initializing Logistic Regression model with adjusted regularization
2025-03-19 15:12:26,674 - INFO - Training model on vectorized training data
2025-03-19 15:12:36,476 - INFO - Model training completed
2025-03-19 15:12:36,477 - INFO - Sample model coefficients:


[-0.71476247  0.28510761  0.05113642 -1.02188221  0.11386094  0.43268987
 -0.0690542   0.25810319 -0.22076131  0.14857283]


## Evaluation of classifier

In [9]:
from sklearn.metrics import classification_report, accuracy_score

logging.info('Evaluating model on test data')
y_pred = model.predict(X_test_vect)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

logging.info(f'Accuracy: {accuracy}')
print('Accuracy:', accuracy)
print('Classification Report:')
print(report)

2025-03-19 15:12:36,499 - INFO - Evaluating model on test data
2025-03-19 15:12:36,603 - INFO - Accuracy: 0.8929


Accuracy: 0.8929
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.89      0.89    200000
           1       0.89      0.90      0.89    200000

    accuracy                           0.89    400000
   macro avg       0.89      0.89      0.89    400000
weighted avg       0.89      0.89      0.89    400000



## Custom Usage Based Evaluator Function

In [10]:
def find_comment_type(comment_text):
    """Predict sentiment for a given comment string using the trained model."""
    if isinstance(comment_text, str):
        comment_text = [comment_text]

    comment_vect = vectorizer.transform(comment_text)

    prediction = model.predict(comment_vect)
    prediction_proba = model.predict_proba(comment_vect)

    logging.info(f'Input comment: {comment_text[0]}')
    logging.info(f'Prediction: {prediction}')
    logging.info(f'Prediction probabilities: {prediction_proba}')
    
    print('Predicted class:', prediction[0])
    print('Prediction probabilities:', prediction_proba)
    return

## Testing

In [12]:
find_comment_type("Hey! worst product ever!")
find_comment_type("hey! this is very amazing! wow!")

2025-03-19 16:17:51,160 - INFO - Input comment: Hey! worst product ever!
2025-03-19 16:17:51,162 - INFO - Prediction: [0]
2025-03-19 16:17:51,163 - INFO - Prediction probabilities: [[9.99774254e-01 2.25746466e-04]]
2025-03-19 16:17:51,166 - INFO - Input comment: hey! this is very amazing! wow!
2025-03-19 16:17:51,167 - INFO - Prediction: [1]
2025-03-19 16:17:51,168 - INFO - Prediction probabilities: [[0.00344593 0.99655407]]


Predicted class: 0
Prediction probabilities: [[9.99774254e-01 2.25746466e-04]]
Predicted class: 1
Prediction probabilities: [[0.00344593 0.99655407]]


## Saving Trained Model

In [14]:
import joblib

joblib.dump(model, 'models/logistic_model.pkl')

joblib.dump(vectorizer, 'models/tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']