Multi-class Classification

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

In [2]:
data = TabularDataset("data/kindle_review/review.csv")
data.head()

Unnamed: 0,asin,rating,reviewText,reviewerID,reviewerName
0,B0033UV8HI,3,"Jace Rankin may be short, but he's nothing to ...",A3HHXRELK8BHQG,Ridley
1,B002HJV4DE,5,Great short read. I didn't want to put it dow...,A2RGNZ0TRF578I,Holly Butler
2,B002ZG96I4,3,I'll start by saying this is the first of four...,A3S0H2HV6U1I7F,Merissa
3,B002QHWOEU,3,Aggie is Angela Lansbury who carries pocketboo...,AC4OQW3GZ919J,Cleargrace
4,B001A06VJ8,4,I did not expect this type of book to be in li...,A3C9V987IQHOQD,Rjostler


In [5]:
len(data)

12000

In [3]:
data.isnull().sum()

asin             0
rating           0
reviewText       0
reviewerID       0
reviewerName    38
dtype: int64

In [4]:
data['reviewText'].iloc[0]

'Jace Rankin may be short, but he\'s nothing to mess with, as the man who was just hauled out of the saloon by the undertaker knows now. He\'s a famous bounty hunter in Oregon in the 1890s who, when he shot the man in the saloon, just finished a years long quest to avenge his sister\'s murder and is now trying to figure out what to do next. When the snotty-nosed farm boy he just rescued from a gang of bullies offers him money to kill a man who forced him off his ranch, he reluctantly agrees to bring the man to justice, but not to kill him outright. But, first he needs to tell his sister\'s widower the news.Kyla "Kyle" Springer Bailey has been riding the trails and sleeping on the ground for the past month while trying to find Jace. She wants revenge on the man who killed her husband and took her ranch, amongst other crimes, and she\'s not so keen on the detour Jace wants to take. But she realizes she\'s out of options, so she hides behind her boy persona as best she can and tries to ke

In [7]:
# Split of train test using sample
train_size = int(len(data)*0.8)
print(f"Training size is: {train_size}")

seed = 42

train_data = data.sample(train_size, random_state=seed)
test_data = data.drop(train_data.index)

Training size is: 9600


In [8]:
# test_data

In [9]:
save_path = "first_mm_model"

predictor = TabularPredictor(label="rating",
                             path=save_path)

In [10]:
predictor.fit(train_data, 
              hyperparameters="multimodal")

Beginning AutoGluon training ...
AutoGluon will save models to "first_mm_model\"
AutoGluon Version:  0.7.0
Python Version:     3.8.10
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22000
Train Data Rows:    9600
Train Data Columns: 4
Label Column: rating
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	5 unique label values:  [3, 1, 2, 4, 5]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 5
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    12388.26 MB
	Train Data (Original)  Memory Usage: 8.62 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set f

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x22c23c9cd60>

In [None]:
# If model generation takes long we can also load prebuilt from directory
# predictor.TabularPredictor.load("book_rating")
# predictor.fit_summary()

In [11]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.618750       0.680925  548.067513                0.001089           0.268682            2       True          7
1             CatBoost   0.607292       0.220986  346.473003                0.220986         346.473003            1       True          3
2        LightGBMLarge   0.552083       0.479332  175.768492                0.479332         175.768492            1       True          6
3           LightGBMXT   0.542708       0.326854   52.345429                0.326854          52.345429            1       True          2
4              XGBoost   0.538542       0.108001  132.109430                0.108001         132.109430            1       True          4
5             LightGBM   0.534375       0.303980   50.364038                0.303980          50.364038 



{'model_types': {'LightGBM': 'LGBModel',
  'LightGBMXT': 'LGBModel',
  'CatBoost': 'CatBoostModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'LightGBM': 0.534375,
  'LightGBMXT': 0.5427083333333333,
  'CatBoost': 0.6072916666666667,
  'XGBoost': 0.5385416666666667,
  'NeuralNetTorch': 0.35625,
  'LightGBMLarge': 0.5520833333333334,
  'WeightedEnsemble_L2': 0.61875},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'LightGBM': 'first_mm_model\\models\\LightGBM\\',
  'LightGBMXT': 'first_mm_model\\models\\LightGBMXT\\',
  'CatBoost': 'first_mm_model\\models\\CatBoost\\',
  'XGBoost': 'first_mm_model\\models\\XGBoost\\',
  'NeuralNetTorch': 'first_mm_model\\models\\NeuralNetTorch\\',
  'LightGBMLarge': 'first_mm_model\\models\\LightGBMLarge\\',
  'WeightedEnsemble_L2': 'first_mm_model\\models\\WeightedEnsemble_L2\\'},
 'model_fit_times': {'Li

In [12]:
y_test = test_data["rating"]
test_features = test_data.drop(columns=["rating"])

In [13]:
y_preds = predictor.predict(test_features)
y_preds

1        5
4        5
5        5
9        4
11       4
        ..
11980    1
11981    5
11983    4
11986    5
11997    2
Name: rating, Length: 2400, dtype: int64

In [14]:
metrics = predictor.evaluate_predictions(y_true=y_test,
                                         y_pred=y_preds,
                                         auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.5845833333333333
Evaluations on test data:
{
    "accuracy": 0.5845833333333333,
    "balanced_accuracy": 0.5644166663419024,
    "mcc": 0.4739617862657992
}


In [15]:
from sklearn.metrics import confusion_matrix

In [16]:
print(confusion_matrix(y_test, y_preds))

[[260  92  21  17  18]
 [116 176  44  32  14]
 [ 24  60 150 148  42]
 [  8  12  55 332 165]
 [  2   1   6 120 485]]
