In [39]:
import itertools
import numpy as np
# from __future__ import unicode_literals

# Text Classification of food items (Unsupervised Learning, with improved accuracy)

In this notebook, I will attempt to classify a food items database into their appropriate segments.

# Contents

1. Introduction
2. Text Pre-processing
3. Text-to-features (Feature Engineering)
4. Text modelling
5. Text Classification
6. Testing and exporting
7. Conclusion

# 1. Introduction

In the following problem, we have a database 'item_list.csv' having 2 columns, item_name and item_id. The tasks are:

1. To come up with appropriate segments for the food items
2. Train a model that predicts the segments

**This is what I did in my previous attempt:**

*"For the first part, I will use clustering algorithm as the problem is unsupervised. For the second part, I will extract labels from the clusters and use them as features for clustering.*

*I will be using Word2Vec using gensim, as Word2Vec has the power to produce word embeddings. Other models like bag of words, tf-idf will not give us the co-relation between the words.*

*As an alternative, I will also be using TextBlob with the NaiveBayesClassifier as it can give a better result than K-means using Scikit-Learn.*

*I wanted to use GloVe and fastText as well, so that we could have had an overview all all models an chosen the best one."*

**For a second attempt, I decided to use GloVe (it attempts to obtain high-dimensional vector representations of words using global word-word co-occurrence) and SpaCy (a general purpose NLP tool which also happens to include pre-trained vectors for the most common English words using the GloVe Common Crawl.)**

So, in short, I'll be making meta-labels(topic_keywords) based on my overview of the data.
Also, I'll be defining the labels(topic_labels) before I train the model.

Then, I'll convert each keyword to a vector using GloVe.
Following which, I'll convert our data into vectors as well.

Finally, I'll compute a similarity matrix of each keyword to each topic, which gives us the output.

## 1.1 Sneak-peak at data

In [1]:
#Importing stuff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
import nltk



In [2]:
import spacy
import en_core_web_md

In [3]:
# Import database
dataset=pd.read_csv('final_food.csv')

In [4]:
# Sneak-peak of data
dataset.head()

Unnamed: 0.1,Unnamed: 0,item_name,id,cuisine,type,sub_type
0,112,chicken noodles,6651,1,1,0
1,229,egg triple schezwan fried rice,9176,1,1,0
2,834,chicken schezwan lollipop,6018,1,1,0
3,837,chicken lollypop,14139,1,1,0
4,869,pudina tandoori momos (non-veg),4689,1,1,0


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13337 entries, 0 to 13336
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  13337 non-null  int64 
 1   item_name   13337 non-null  object
 2   id          13337 non-null  int64 
 3   cuisine     13337 non-null  int64 
 4   type        13337 non-null  int64 
 5   sub_type    13337 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 625.3+ KB


I tried to solve the issue of not deleting the id's in the following way:
    1. Making all required changes to data without deleting the id
    2. Slicing and storing id column in a dataframe before modelling
    3. Appending the id frame back to the original database
    
The issue in this was:
After making all changes to data, there are less entries in the data, and thus id and data could not be concatenated as the dimensions now differ.

This issue will be solved(#1)

In [6]:
# Drop the unnecessary columns.
dataset.drop(labels = ["Unnamed: 0"], axis = 1, inplace = True)

In [7]:
#dataset.drop(labels = ["id"], axis = 1, inplace = True)

In [8]:
dataset.values

array([['chicken noodles', 6651, 1, 1, 0],
       ['egg triple schezwan fried rice', 9176, 1, 1, 0],
       ['chicken schezwan lollipop', 6018, 1, 1, 0],
       ...,
       ['vodka - absolut d.f.', 11528, 6, 0, 5],
       ['vodka - absolut d.f.', 11639, 6, 0, 5],
       ['vodka - absolut d.f.', 11616, 6, 0, 5]], dtype=object)

In [9]:
# Brief information about dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13337 entries, 0 to 13336
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   item_name  13337 non-null  object
 1   id         13337 non-null  int64 
 2   cuisine    13337 non-null  int64 
 3   type       13337 non-null  int64 
 4   sub_type   13337 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 521.1+ KB


In [10]:
#Check for null values
dataset.isnull().sum()

item_name    0
id           0
cuisine      0
type         0
sub_type     0
dtype: int64

In [11]:
dataset.describe()

Unnamed: 0,id,cuisine,type,sub_type
count,13337.0,13337.0,13337.0,13337.0
mean,14588.598485,3.530329,0.055635,1.570368
std,7985.007552,1.611623,0.229224,2.05237
min,2.0,0.0,0.0,0.0
25%,7963.0,3.0,0.0,0.0
50%,15008.0,4.0,0.0,0.0
75%,21197.0,5.0,0.0,3.0
max,28924.0,6.0,1.0,8.0


In [12]:
dataset.head()

Unnamed: 0,item_name,id,cuisine,type,sub_type
0,chicken noodles,6651,1,1,0
1,egg triple schezwan fried rice,9176,1,1,0
2,chicken schezwan lollipop,6018,1,1,0
3,chicken lollypop,14139,1,1,0
4,pudina tandoori momos (non-veg),4689,1,1,0


Thus, in this section, we have had a good look at our data.

# 2. Text pre-processing

The dataset is raw and has many errors. We will implement three major changes in our dataset. They are:
    1. Convert all data to lowercase
    2. Get rid of the punctuation
    3. Remove numbers from dataset
    4. Remove specific elements having no significant use (delivery charges@30, gi--)

## 2.1 Converting to lowercase

In [13]:
#Converting data to lower-case using Lambda function
a = dataset.apply(lambda x: x.astype(str).str.lower())

In [14]:
df1 = a['id']

In the above code, we have sliced and stored id column into a dataframe df1, which will be appended again to the main database. The problem here was after removing the punctuation, it deletes the id as well for some reason. Also, the number of rows are reduced, and should be in sync with id colun before slicing it.

Will fix this issue(#1)

In [15]:
a.head()

Unnamed: 0,item_name,id,cuisine,type,sub_type
0,chicken noodles,6651,1,1,0
1,egg triple schezwan fried rice,9176,1,1,0
2,chicken schezwan lollipop,6018,1,1,0
3,chicken lollypop,14139,1,1,0
4,pudina tandoori momos (non-veg),4689,1,1,0


In [16]:
#new_a=a.drop('id',axis=1)

## 2.2 Removing punctuation

In [17]:
# Getting rid of all punctuation
b=a.apply(lambda x: x.astype(str).str.replace('[^\w\s]',''))

  


In [18]:
b.shape

(13337, 5)

In [19]:
b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13337 entries, 0 to 13336
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   item_name  13337 non-null  object
 1   id         13337 non-null  object
 2   cuisine    13337 non-null  object
 3   type       13337 non-null  object
 4   sub_type   13337 non-null  object
dtypes: object(5)
memory usage: 521.1+ KB


## 2.3 Removing digits

In [20]:
f=b['item_name'].str.replace('\d+', '')

  """Entry point for launching an IPython kernel.


In [21]:
f

0                       chicken noodles
1        egg triple schezwan fried rice
2             chicken schezwan lollipop
3                      chicken lollypop
4          pudina tandoori momos nonveg
                      ...              
13332             vodka smirnoff orange
13333                    vodka belvedre
13334                 vodka  absolut df
13335                 vodka  absolut df
13336                 vodka  absolut df
Name: item_name, Length: 13337, dtype: object

In [22]:
# Converting a pandas dataframe to a numpy array
c=f.values

In [23]:
c

array(['chicken noodles', 'egg triple schezwan fried rice',
       'chicken schezwan lollipop', ..., 'vodka  absolut df',
       'vodka  absolut df', 'vodka  absolut df'], dtype=object)

## 2.4 Removing specific words having no significance (delivery charge@30, gi--3557)

Here, since we do not know the index of the words in the database, we will use the argwhere() function.

The issue here is this:
    1. I have defined an array in every new line.
    2. The names of defined arrays are confusing.
   
Both the issues will be fixed(**#2**)

In [24]:
#Find and delete index having word as delivery charge@
index = np.argwhere((c=='delivery charge@'))
h=np.delete(c, index)

In [25]:
h

array(['chicken noodles', 'egg triple schezwan fried rice',
       'chicken schezwan lollipop', ..., 'vodka  absolut df',
       'vodka  absolut df', 'vodka  absolut df'], dtype=object)

In [26]:
#Find and delete index having word as gi
index = np.argwhere(h=='gi')
i=np.delete(h, index)

In [27]:
i

array(['chicken noodles', 'egg triple schezwan fried rice',
       'chicken schezwan lollipop', ..., 'vodka  absolut df',
       'vodka  absolut df', 'vodka  absolut df'], dtype=object)

In [28]:
d=i

In [29]:
d

array(['chicken noodles', 'egg triple schezwan fried rice',
       'chicken schezwan lollipop', ..., 'vodka  absolut df',
       'vodka  absolut df', 'vodka  absolut df'], dtype=object)

Thus, we have successfully cleaned our text data.

# 3. Feature Engineering 

Converting text data to vectors. I have taken the transpose of the clean data array and converted it to a list as the input parameter in the model requires a list

In [30]:
# Convert to matrix
x = np.matrix(d)

In [31]:
# Take transpose of matrix
e=x.T

In [32]:
# Convert back to array
A = np.squeeze(np.asarray(e))

In [33]:
# Convert array to list
keywords=np.array(A).tolist()

In [34]:
keywords

['chicken noodles',
 'egg triple schezwan fried rice',
 'chicken schezwan lollipop',
 'chicken lollypop',
 'pudina tandoori momos nonveg',
 'chicken combination schezwan',
 'shanghai chicken noodles',
 'fried egg fried rice',
 'schezwan chicken noodles',
 'egg hakka noodles',
 'chilli garlic chicken noodles',
 'hakka noodles chicken',
 'butter chickenegg fried ricedrink  ml',
 'chicken steamed momo  pieces',
 'chicken triple schezwan fried rice',
 'hakka noodles  chicken half',
 'egg fried rice',
 'schezwan chicken dry',
 'phad thai noodles prawns',
 'burnt chilli egg fried rice',
 'fish manchurian',
 'chicken schezwan shawarma',
 'chicken shanghai noodles',
 'chicken lollipop',
 'steamed  chicken  prawn momo',
 'malaysian chicken noodles',
 'chicken schezwan rice',
 'chicken chilly and chicken fried rice or noodles',
 'chicken manchurian noodles',
 'singapore style fried rice non veg',
 'egg chilli garlic noodles',
 'chicken hakka noodles',
 'chicken garlic noodles',
 'egg schezwan no

In [35]:
# Notice that many words are removed in the cleaning process
print(len(keywords))

13337


In [36]:
nlp = en_core_web_md.load()

I've considered these five labels to classify the data.

In [40]:
topic_labels = [
  'Veg',
  'Non-Veg',
  'Non-alcoholic beverages',
  'Alcoholic beverages',
  'Desserts'
]

*After taking a good look at our raw data, I've taken several keywords which can be associated with the labels. These will help us in getting a good accuracy score.*

They can be further tweaked a bit to impove accuracy more

In [41]:
topic_keywords=[
    'veg vegetable paneer potato aloo dal cheese wrap',
    'chicken muttton fish prawn bacon pepperoni omellete shawrma',
    'milk tea coffee shake soup soft drinks cafe juice frappe cappuccino',
    'beer whisky alcohol vodka mojito',
    'mousse pancakes nutella waffles pastry choco chocolate brownie cake ice cream'
]

In [42]:
topic_docs = list(nlp.pipe(topic_keywords, batch_size=10000,
  n_threads=3))

In [43]:
topic_vectors = np.array([doc.vector 
  if doc.has_vector else spacy.vocab[0].vector
  for doc in topic_docs])

In [44]:
# Print topic vector for our first label, Veg
print(topic_labels[0])
print(topic_vectors[0])

Veg
[-3.24378878e-01 -7.63103738e-02  3.97118747e-01  2.39687487e-02
  5.06347492e-02  7.03979969e-01 -1.82393640e-02 -2.02608362e-01
  7.28636086e-02  4.54423606e-01 -5.68820000e-01  2.47344241e-01
 -1.84499115e-01 -1.64249688e-01  2.48333752e-01 -1.51596755e-01
  1.57408506e-01  9.88685012e-01  2.48468995e-01  2.24234939e-01
  8.03420097e-02  4.71788719e-02 -2.19443738e-02  2.57407516e-01
  6.72015026e-02 -1.62577614e-01 -3.42857778e-01  3.39884222e-01
  1.92930490e-01 -6.35887504e-01 -1.47502497e-01  9.75522473e-02
  2.10239977e-01 -4.10250008e-01  2.41726354e-01  1.81529433e-01
  9.83289108e-02  1.07892379e-01 -7.45120049e-02  4.51712877e-01
  1.13021001e-01 -1.09365001e-01 -1.28557712e-01 -3.97226252e-02
 -8.30580071e-02  4.96325999e-01 -3.26073691e-02  4.77239996e-01
 -7.17169419e-02  3.41127515e-01  5.30605018e-02 -2.06014737e-01
  4.22744483e-01 -1.77046254e-01  3.37481260e-01 -4.00706261e-01
  1.29671007e-01  1.39504254e-01  1.78783610e-02  1.02693997e-01
  4.96412516e-02 -1.1

In [45]:
# View all topic vectors
topic_vectors

array([[-0.32437888, -0.07631037,  0.39711875, ..., -0.6509937 ,
         0.22559974,  0.20750675],
       [-0.23657374, -0.14360987,  0.27317825, ..., -0.43897274,
         0.11324563, -0.03115   ],
       [-0.08931038, -0.01475291,  0.21033691, ..., -0.5406088 ,
         0.14321029,  0.25229844],
       [-0.309226  ,  0.1686554 ,  0.21253319, ..., -0.538742  ,
         0.06441001,  0.18663299],
       [ 0.11771746, -0.0313131 ,  0.28859174, ..., -0.66667634,
        -0.16483046,  0.57205635]], dtype=float32)

In [46]:
print(topic_labels[2])
print(topic_vectors[2])

Non-alcoholic beverages
[-8.93103778e-02 -1.47529086e-02  2.10336909e-01 -1.01995640e-01
  7.77117489e-03  2.48992637e-01 -3.24770004e-01  8.50905180e-02
  3.13643813e-01  1.19282544e+00 -4.32006091e-01  4.52123322e-02
 -4.34504509e-01 -2.81927437e-01  2.65879631e-01 -2.26651192e-01
 -3.18650991e-01  1.20925820e+00  1.17705181e-01 -7.99868405e-02
 -1.27639100e-01 -2.38283817e-02  3.73309315e-03 -1.27673998e-01
  1.07386999e-01 -4.33483720e-01  9.65854526e-03 -7.78194591e-02
  1.78675458e-01 -6.85412705e-01  1.00215644e-01  4.90201600e-02
  9.74987000e-02 -2.19014227e-01 -4.72317301e-02  2.10159644e-01
  6.92238212e-02  6.19738139e-02 -2.22038016e-01  3.46211821e-01
  4.46129106e-02 -4.93921489e-02 -1.02479368e-01 -5.34951799e-02
 -1.87166467e-01  3.93513590e-01 -2.45290458e-01  6.65829107e-02
 -2.36766100e-01 -3.92292589e-02 -6.47206604e-02 -5.78493588e-02
 -1.90516904e-01  2.62800187e-01  3.77149850e-01 -5.84183633e-01
  1.46548217e-02 -1.66107595e-01 -1.44024268e-01 -1.11534543e-01
 

#  4. Modelling

We will input our clean data here

In [47]:
keyword_docs = list(nlp.pipe(keywords,
  batch_size=10000,
  n_threads=3))

In [48]:
keyword_vectors = np.array([doc.vector
  if doc.has_vector else spacy.vocab[0].vector
  for doc in keyword_docs])

In [49]:
# Vector for our data
print(keywords[0])
print(keyword_vectors[0])

chicken noodles
[-4.22044992e-01 -5.14170006e-02  4.86104965e-01  8.67079943e-02
  2.39304990e-01  1.06842995e+00 -2.48064995e-01 -2.71620005e-01
  4.07595515e-01  6.92620516e-01 -6.94674969e-01  4.41139996e-01
 -5.95850050e-02 -2.21540004e-01  3.34649980e-01 -1.03672504e-01
  1.35318890e-01  1.09670997e+00 -8.11960027e-02  5.45265019e-01
 -1.64544992e-02  4.05960009e-02 -1.51843503e-01  8.50025043e-02
  2.04469506e-02 -3.07802975e-01 -4.65790004e-01 -1.94914997e-01
  1.26453996e-01 -9.12675023e-01  1.95215017e-01  2.30930001e-01
  2.32898995e-01 -6.45735025e-01  5.06935000e-01  2.44352996e-01
  9.85110030e-02  1.48250014e-02  1.93180099e-01  1.07679999e+00
  1.36370003e-01  4.94294986e-02 -3.17429975e-02 -6.46355003e-02
  1.24300502e-01  6.50319993e-01 -8.92419964e-02  4.45145011e-01
  1.04603499e-01  2.27221996e-01 -2.30295494e-01 -1.12717003e-01
  1.59869995e-02  1.94330007e-01  1.54199973e-02 -2.52979994e-01
  4.93389994e-01  2.89909989e-01  1.99524999e-01  6.79165006e-01
  1.51979

## 4.1 Computing the cosine similarity

We’ll compute a similarity matrix of each keyword to each topic. Cosine similarity has been shown to work well for word vector similarity, so we’ll compute cross-wise similarity and then assign each keyword to the topic it is most similar to.

In [50]:
from sklearn.metrics.pairwise import cosine_similarity

In [51]:
simple_sim = cosine_similarity(keyword_vectors, topic_vectors)
topic_idx = simple_sim.argmax(axis=1)
print(simple_sim)

[[0.8404847  0.8566778  0.62187225 0.40534535 0.63998145]
 [0.82303536 0.8087372  0.68177664 0.465821   0.71547914]
 [0.7719281  0.8004707  0.7228648  0.5350555  0.7785764 ]
 ...
 [0.27824357 0.28585905 0.3543091  0.52943426 0.28443155]
 [0.27824357 0.28585905 0.3543091  0.52943426 0.28443155]
 [0.27824357 0.28585905 0.3543091  0.52943426 0.28443155]]


In [52]:
p=[]

# 5. Classifying

In [53]:
for k, i in zip(keywords, topic_idx):
  p.append((k, topic_labels[i]))

In [54]:
from numpy import array
final=array(p)

In [55]:
final

array([['chicken noodles', 'Non-Veg'],
       ['egg triple schezwan fried rice', 'Veg'],
       ['chicken schezwan lollipop', 'Non-Veg'],
       ...,
       ['vodka  absolut df', 'Alcoholic beverages'],
       ['vodka  absolut df', 'Alcoholic beverages'],
       ['vodka  absolut df', 'Alcoholic beverages']], dtype='<U87')

This is the section where we could have concatenated id back with the classified dataset

In [56]:
#np.concatenate([final,df1])

# 6. Exporting final results

In [57]:
df = pd.DataFrame(final, columns = ['Item Name','Item Type'])

In [81]:
df.sample(15)

Unnamed: 0,Item Name,Item Type
2610,wrap rajma wrap with cheese,Veg
260,gobi noodles,Veg
12041,passion fruit smoothie,Non-alcoholic beverages
12748,tipsy whisky small,Alcoholic beverages
12533,veg clear soup,Veg
1772,garlic naan,Veg
855,malai kofta,Veg
11157,peach iced tea,Non-alcoholic beverages
9770,ganache well cake shake,Desserts
13290,wine grover chenin blanc art collection ml,Non-alcoholic beverages


In [69]:
df.to_csv('data.csv', index = False)

In [661]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27715 entries, 0 to 27714
Data columns (total 2 columns):
Item Name    27715 non-null object
Item Type    27715 non-null object
dtypes: object(2)
memory usage: 433.1+ KB


## 6.1 Viewing the results

In [662]:
df.head(30)

Unnamed: 0,Item Name,Item Type
0,peri peri wrap,Veg
1,milk chocolate tub,Non-alcoholic beverages
2,soft drinks large,Non-alcoholic beverages
3,cheese omellete,Non-Veg
4,cheesy dip,Veg
5,blueberry cream cheese sw waffles,Desserts
6,cafe mocha,Non-alcoholic beverages
7,exotic veg with sauce,Veg
8,basket chaat,Veg
9,red paprika half,Veg


As we can see, many of the items are correctly classified. There are still some problems, in which there contains both keywords from more than one category. 

# 7. Conclusion

I used a totally different approach this time and the results are a bit satisfactory to the previous ones. 

There still persist some issues, and I will solve them.
    1. No data visualiations
    2. Getting a exact accuracy score
    3. Item_id issue

Almost all the pointers which you suggested have been taken into consideration:
    1. Meta-labelling 
    2. Removing text data like delivery charges, etc
    3. Deleting the item_ids(not solved)
    
Due to your helpful pointers, we got a better accuracy in this attempt.



**Thank you for reading**