Setup/import steps:

In [0]:
import json
import os
#To mount the drive from where files can be read
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Preprocessing and loading the data which will be used for training the model in the expected format (features list and labels list):

In [0]:
#Reading the dataset and loading the data as required
f = open("/content/drive/My Drive/AI ML Group/lr_dataset.json","r")
res = json.load(f)
items = res['data']
#print (len(items))
titles = list()
categories = list()
for item in items:
  titles.append(item['title'])
  categories.append(item['comprehensive_category'])

In [0]:
#print the first 5 title and categories..just to check  
print (titles[:5])  
print (categories[:5])

["Women's Caftan Dress - Xhilaration&#153; (Juniors') Jade L", "Women's Plus Size Denim Jacket  - Universal Thread&#153; White Wash X", "Vegas Golden Knights Women's Short Sleeve Heathered T-Shirt - L", "Girls' Rockergirl Studded Ballets - Stevies Black 1", '1/3 CT. T.W. Simulated Sapphire Trio Circle Necklace in Sterling Silver - Sapphire']
['dresses', 'jackets', 't-shirts', 'shoes', 'necklaces']


Actual application logic begins here.
The loaded data is split into train and test samples. The 'train' samples will be used to train the model, and the 'test' samples will be used to evaluate the accuracy of the model on data which it has not seen before.
The test size is taken as 25% of the available data, with a specific random seed so that the actual data split remains the same across different runs.

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(titles, categories, test_size=0.25, random_state=0)

In [0]:
print (len(x_train))
print (len(y_train))
print (len(x_test))
print (len(y_test))

24297
24297
8099
8099


Our feature data (titles) are still in string form, and we need them to be numeric values so that a model can be trained. The CountVectorizer uses a simple count of terms in order to compute vector representations of the strings. We set binary=True to facilitate one-vs-rest scheme.
We also configure the vectorizer to use word-ngrams with up to 2 words. Any word occurring in over 90% of the documents is dropped (max_df is the max document frequency allowed).
We strip accents from the string to allow only the ascii character set.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer(strip_accents='ascii', analyzer='word', ngram_range=(1,2), max_df=0.9, binary=True)
cv.fit_transform(x_train)

train_feature_set=cv.transform(x_train) #transforming titles to vectors for train set
test_feature_set=cv.transform(x_test) #transforming titles to vectors for test set

We then instantiate a LogisticRegression object which will be used to create our model. The 'balanced' class_weight handles class imbalances in the number of training samples for each class. The BFGS algorithm is used since the dataset is small. The 'multinomial' class specifies that the data has multiple classes

In [0]:
#generating the model
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(class_weight='balanced', solver='lbfgs', multi_class='ovr', max_iter=1000)

model = logisticRegr.fit(train_feature_set, y_train)

We can then pass the test samples, which the model has not seen before, to predict the labels for these samples.

In [0]:
#testing the model
predictions = model.predict(test_feature_set[:3])
result = (dict(zip(x_test[:3], predictions)))
for key in result:
  print ("%s: %s" % (key, result[key]))

Women's Plus Faux Fur Jacket - Ava & Viv&#153; Burgundy 2X: jackets
Women's Stranger Things&#174; Character Walking Raglan 3/4 Sleeve T-Shirt (Juniors') - Gray/Red L: t-shirts
18k Rose Gold Plated and Sterling Silver Diamond Accent Wing Pendant with 18" Chain: accessories


The 'score' method computes the mean accuracy on the given test data and labels for the model we have trained.

In [0]:
#Finding the score of the model to see how accurate it is
score = model.score(test_feature_set, y_test)
print(score)

0.9512285467341647
