# Text Classification of food items (Unsupervised Learning, with improved accuracy)

In this notebook, I will attempt to classify a food items database into their appropriate segments.

# Contents

1. Introduction
2. Text Pre-processing
3. Text-to-features (Feature Engineering)
4. Text modelling
5. Text Classification
6. Testing and exporting
7. Conclusion

# 1. Introduction

In the following problem, we have a database 'item_list.csv' having 2 columns, item_name and item_id. The tasks are:

1. To come up with appropriate segments for the food items
2. Train a model that predicts the segments

**This is what I did in my previous attempt:**

*"For the first part, I will use clustering algorithm as the problem is unsupervised. For the second part, I will extract labels from the clusters and use them as features for clustering.*

*I will be using Word2Vec using gensim, as Word2Vec has the power to produce word embeddings. Other models like bag of words, tf-idf will not give us the co-relation between the words.*

*As an alternative, I will also be using TextBlob with the NaiveBayesClassifier as it can give a better result than K-means using Scikit-Learn.*

*I wanted to use GloVe and fastText as well, so that we could have had an overview all all models an chosen the best one."*

**For a second attempt, I decided to use GloVe (it attempts to obtain high-dimensional vector representations of words using global word-word co-occurrence) and SpaCy (a general purpose NLP tool which also happens to include pre-trained vectors for the most common English words using the GloVe Common Crawl.)**

So, in short, I'll be making meta-labels(topic_keywords) based on my overview of the data.
Also, I'll be defining the labels(topic_labels) before I train the model.

Then, I'll convert each keyword to a vector using GloVe.
Following which, I'll convert our data into vectors as well.

Finally, I'll compute a similarity matrix of each keyword to each topic, which gives us the output.

## 1.1 Sneak-peak at data

In [587]:
#Importing stuff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import nltk

In [588]:
import spacy
import en_core_web_md

In [589]:
# Import database
dataset=pd.read_csv('item_list.csv', encoding='ISO-8859-1')

In [590]:
# Sneak-peak of data
dataset.head()

Unnamed: 0.1,Unnamed: 0,item_name,id
0,0,peri peri wrap,2444
1,1,gi-7161-19,24806
2,2,Milk Chocolate Tub,22729
3,3,Soft Drinks Large,12419
4,4,Cheese Omellete,3421


In [591]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28922 entries, 0 to 28921
Data columns (total 3 columns):
Unnamed: 0    28922 non-null int64
item_name     28922 non-null object
id            28922 non-null int64
dtypes: int64(2), object(1)
memory usage: 677.9+ KB


I tried to solve the issue of not deleting the id's in the following way:
    1. Making all required changes to data without deleting the id
    2. Slicing and storing id column in a dataframe before modelling
    3. Appending the id frame back to the original database
    
The issue in this was:
After making all changes to data, there are less entries in the data, and thus id and data could not be concatenated as the dimensions now differ.

This issue will be solved(#1)

In [592]:
# Drop the unnecessary columns.
dataset.drop(labels = ["Unnamed: 0"], axis = 1, inplace = True)

In [593]:
#dataset.drop(labels = ["id"], axis = 1, inplace = True)

In [594]:
dataset.values

array([['peri peri wrap', 2444],
       ['gi-7161-19', 24806],
       ['Milk Chocolate Tub', 22729],
       ...,
       ['Thums Up 200 Ml', 5774],
       ['Fruit Wine (Large)', 6561],
       ['coca cola 300 ml', 20649]], dtype=object)

In [595]:
# Brief information about dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28922 entries, 0 to 28921
Data columns (total 2 columns):
item_name    28922 non-null object
id           28922 non-null int64
dtypes: int64(1), object(1)
memory usage: 452.0+ KB


In [596]:
#Check for null values
dataset.isnull().sum()

item_name    0
id           0
dtype: int64

In [597]:
dataset.describe()

Unnamed: 0,id
count,28922.0
mean,14463.499862
std,8349.206819
min,2.0
25%,7233.25
50%,14463.5
75%,21693.75
max,28924.0


In [598]:
dataset.head()

Unnamed: 0,item_name,id
0,peri peri wrap,2444
1,gi-7161-19,24806
2,Milk Chocolate Tub,22729
3,Soft Drinks Large,12419
4,Cheese Omellete,3421


Thus, in this section, we have had a good look at our data.

# 2. Text pre-processing

The dataset is raw and has many errors. We will implement three major changes in our dataset. They are:
    1. Convert all data to lowercase
    2. Get rid of the punctuation
    3. Remove numbers from dataset
    4. Remove specific elements having no significant use (delivery charges@30, gi--)

## 2.1 Converting to lowercase

In [599]:
#Converting data to lower-case using Lambda function
a = dataset.apply(lambda x: x.astype(str).str.lower())

In [600]:
df1 = a['id']

In the above code, we have sliced and stored id column into a dataframe df1, which will be appended again to the main database. The problem here was after removing the punctuation, it deletes the id as well for some reason. Also, the number of rows are reduced, and should be in sync with id colun before slicing it.

Will fix this issue(#1)

In [608]:
a.head()

Unnamed: 0,item_name,id
0,peri peri wrap,2444
1,gi-7161-19,24806
2,milk chocolate tub,22729
3,soft drinks large,12419
4,cheese omellete,3421


In [612]:
#new_a=a.drop('id',axis=1)

## 2.2 Removing punctuation

In [685]:
# Getting rid of all punctuation
b=a.apply(lambda x: x.astype(str).str.replace('[^\w\s]',''))

In [701]:
b.shape

(28922, 2)

In [686]:
b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28922 entries, 0 to 28921
Data columns (total 2 columns):
item_name    28922 non-null object
id           28922 non-null object
dtypes: object(2)
memory usage: 452.0+ KB


## 2.3 Removing digits

In [687]:
f=b['item_name'].str.replace('\d+', '')

In [722]:
f

0                                        peri peri wrap
1                                                    gi
2                                    milk chocolate tub
3                                     soft drinks large
4                                       cheese omellete
5                                            cheesy dip
6                     blueberry cream cheese sw waffles
7                                            cafe mocha
8                                 exotic veg with sauce
9                                          basket chaat
10                                     red paprika half
11                                oreo obsessed pancake
12                                     raspberry medium
13                               chicken nuggets swiggy
14                                 chicken garlic salad
15                                    nuts over nutella
16                                  marshmallow brownie
17                                      delivery

In [712]:
# Converting a pandas dataframe to a numpy array
c=f.values

In [713]:
c

array(['peri peri wrap', 'gi', 'milk chocolate tub', ..., 'thums up  ml',
       'fruit wine large', 'coca cola  ml'], dtype=object)

## 2.4 Removing specific words having no significance (delivery charge@30, gi--3557)

Here, since we do not know the index of the words in the database, we will use the argwhere() function.

The issue here is this:
    1. I have defined an array in every new line.
    2. The names of defined arrays are confusing.
   
Both the issues will be fixed(**#2**)

In [715]:
#Find and delete index having word as delivery charge@
index = np.argwhere((c=='delivery charge@'))
h=np.delete(c, index)

In [723]:
h

array(['peri peri wrap', 'gi', 'milk chocolate tub', ..., 'thums up  ml',
       'fruit wine large', 'coca cola  ml'], dtype=object)

In [717]:
#Find and delete index having word as gi
index = np.argwhere(h=='gi')
i=np.delete(h, index)

In [718]:
i

array(['peri peri wrap', 'milk chocolate tub', 'soft drinks large', ...,
       'thums up  ml', 'fruit wine large', 'coca cola  ml'], dtype=object)

In [720]:
d=i

In [721]:
d

array(['peri peri wrap', 'milk chocolate tub', 'soft drinks large', ...,
       'thums up  ml', 'fruit wine large', 'coca cola  ml'], dtype=object)

Thus, we have successfully cleaned our text data.

# 3. Feature Engineering 

Converting text data to vectors. I have taken the transpose of the clean data array and converted it to a list as the input parameter in the model requires a list

In [634]:
# Convert to matrix
x = np.matrix(d)

In [635]:
# Take transpose of matrix
e=x.T

In [636]:
# Convert back to array
A = np.squeeze(np.asarray(e))

In [637]:
# Convert array to list
keywords=np.array(A).tolist()

In [638]:
keywords

['peri peri wrap',
 'milk chocolate tub',
 'soft drinks large',
 'cheese omellete',
 'cheesy dip',
 'blueberry cream cheese sw waffles',
 'cafe mocha',
 'exotic veg with sauce',
 'basket chaat',
 'red paprika half',
 'oreo obsessed pancake',
 'raspberry medium',
 'chicken nuggets swiggy',
 'chicken garlic salad',
 'nuts over nutella',
 'marshmallow brownie',
 'delivery charge',
 'schezwan paneer pizza waffle half',
 'espresso  cappuccino',
 'matka chicken',
 'pizza waffle full',
 'cottage cheese panino',
 'xtra honey',
 'banana  salted caramel puffle',
 'nasty nutella pancake',
 'whisky black  white',
 'bbq wrap',
 'veg fried rice',
 'feta greek',
 'penne creamy pesto',
 'choco chips',
 'white chocolate puffle',
 'delivery charge',
 'dark chocolate waffwich',
 'dark chocolate sauce',
 'chocolate chips white',
 'crispy chilli potato',
 'spicy corn kernel fr rice half',
 'butterscotch crunch',
 'nioxin derma renew',
 'dark chocolate mousse',
 'nutella puffle',
 'khushgrill chicken steak 

In [663]:
# Notice that many words are removed in the cleaning process
print(len(keywords))

27715


In [639]:
nlp = en_core_web_md.load()

In [640]:
import itertools
import numpy as np
from __future__ import unicode_literals

I've considered these five labels to classify the data.

In [641]:
topic_labels = [
  'Veg',
  'Non-Veg',
  'Non-alcoholic beverages',
  'Alcoholic beverages',
  'Desserts'
]

*After taking a good look at our raw data, I've taken several keywords which can be associated with the labels. These will help us in getting a good accuracy score.*

They can be further tweaked a bit to impove accuracy more

In [642]:
topic_keywords=[
    'veg vegetable paneer potato aloo dal cheese wrap',
    'chicken muttton fish prawn bacon pepperoni omellete shawrma',
    'milk tea coffee shake soup soft drinks cafe juice frappe cappuccino',
    'beer whisky alcohol vodka mojito',
    'mousse pancakes nutella waffles pastry choco chocolate brownie cake ice cream'
]

In [643]:
topic_docs = list(nlp.pipe(topic_keywords, batch_size=10000,
  n_threads=3))

In [644]:
topic_vectors = np.array([doc.vector 
  if doc.has_vector else spacy.vocab[0].vector
  for doc in topic_docs])

In [729]:
# Print topic vector for our first label, Veg
print(topic_labels[0])
print(topic_vectors[0])

Veg
[-3.24378878e-01 -7.63103738e-02  3.97118747e-01  2.39687487e-02
  5.06347492e-02  7.03979969e-01 -1.82393640e-02 -2.02608362e-01
  7.28636086e-02  4.54423606e-01 -5.68820000e-01  2.47344241e-01
 -1.84499115e-01 -1.64249688e-01  2.48333752e-01 -1.51596755e-01
  1.57408506e-01  9.88685012e-01  2.48468995e-01  2.24234939e-01
  8.03420097e-02  4.71788719e-02 -2.19443738e-02  2.57407516e-01
  6.72015026e-02 -1.62577614e-01 -3.42857778e-01  3.39884222e-01
  1.92930490e-01 -6.35887504e-01 -1.47502497e-01  9.75522473e-02
  2.10239977e-01 -4.10250008e-01  2.41726354e-01  1.81529433e-01
  9.83289108e-02  1.07892379e-01 -7.45120049e-02  4.51712877e-01
  1.13021001e-01 -1.09365001e-01 -1.28557712e-01 -3.97226252e-02
 -8.30580071e-02  4.96325999e-01 -3.26073691e-02  4.77239996e-01
 -7.17169419e-02  3.41127515e-01  5.30605018e-02 -2.06014737e-01
  4.22744483e-01 -1.77046254e-01  3.37481260e-01 -4.00706261e-01
  1.29671007e-01  1.39504254e-01  1.78783610e-02  1.02693997e-01
  4.96412516e-02 -1.1

In [645]:
# View all topic vectors
topic_vectors

array([[-0.32437888, -0.07631037,  0.39711875, ..., -0.6509937 ,
         0.22559974,  0.20750675],
       [-0.23657374, -0.14360987,  0.27317825, ..., -0.43897274,
         0.11324563, -0.03115   ],
       [-0.08931038, -0.01475291,  0.21033691, ..., -0.5406088 ,
         0.14321029,  0.25229844],
       [-0.309226  ,  0.1686554 ,  0.21253319, ..., -0.538742  ,
         0.06441001,  0.18663299],
       [ 0.11771746, -0.0313131 ,  0.28859174, ..., -0.66667634,
        -0.16483046,  0.57205635]], dtype=float32)

In [646]:
print(topic_labels[2])
print(topic_vectors[2])

Non-alcoholic beverages
[-8.93103778e-02 -1.47529086e-02  2.10336909e-01 -1.01995640e-01
  7.77117489e-03  2.48992637e-01 -3.24770004e-01  8.50905180e-02
  3.13643813e-01  1.19282544e+00 -4.32006091e-01  4.52123322e-02
 -4.34504509e-01 -2.81927437e-01  2.65879631e-01 -2.26651192e-01
 -3.18650991e-01  1.20925820e+00  1.17705181e-01 -7.99868405e-02
 -1.27639100e-01 -2.38283817e-02  3.73309315e-03 -1.27673998e-01
  1.07386999e-01 -4.33483720e-01  9.65854526e-03 -7.78194591e-02
  1.78675458e-01 -6.85412705e-01  1.00215644e-01  4.90201600e-02
  9.74987000e-02 -2.19014227e-01 -4.72317301e-02  2.10159644e-01
  6.92238212e-02  6.19738139e-02 -2.22038016e-01  3.46211821e-01
  4.46129106e-02 -4.93921489e-02 -1.02479368e-01 -5.34951799e-02
 -1.87166467e-01  3.93513590e-01 -2.45290458e-01  6.65829107e-02
 -2.36766100e-01 -3.92292589e-02 -6.47206604e-02 -5.78493588e-02
 -1.90516904e-01  2.62800187e-01  3.77149850e-01 -5.84183633e-01
  1.46548217e-02 -1.66107595e-01 -1.44024268e-01 -1.11534543e-01
 

#  4. Modelling

We will input our clean data here

In [647]:
keyword_docs = list(nlp.pipe(keywords,
  batch_size=10000,
  n_threads=3))

In [648]:
keyword_vectors = np.array([doc.vector
  if doc.has_vector else spacy.vocab[0].vector
  for doc in keyword_docs])

In [649]:
# Vector for our data
print(keywords[0])
print(keyword_vectors[0])

peri peri wrap
[ 2.59444684e-01 -1.00313336e-01  2.74570018e-01  3.72499317e-01
  9.14593339e-02  1.02089994e-01  2.97270030e-01  2.53864348e-01
  2.00463012e-01 -4.94766645e-02 -1.08347327e-01 -3.33615333e-01
  4.83299971e-01 -1.47579998e-01 -2.52698004e-01 -6.33096620e-02
 -1.93663314e-01  6.65633380e-01 -8.48433375e-02 -2.49800030e-02
  2.60699987e-02  2.33166609e-02 -6.48333356e-02  2.04908013e-01
  2.67069995e-01 -1.19893335e-01 -3.76466662e-01 -1.29619995e-02
  1.34786665e-01  2.52466649e-01 -1.88200027e-02  8.96866620e-03
 -3.32243323e-01 -2.12303340e-01 -1.95200052e-02  2.87405010e-02
  1.80900678e-01 -9.51429978e-02 -8.33433270e-02  1.45624325e-01
  1.12314664e-01  1.67680010e-01 -4.04880010e-02 -1.77794680e-01
 -2.29870006e-01 -2.37937346e-01 -3.18850666e-01 -4.10000468e-03
 -2.21501186e-01  8.62626731e-03  1.15657665e-01 -8.05592686e-02
  4.10733335e-02 -5.39063334e-01  3.84093314e-01 -2.74853315e-02
  8.64799786e-03  4.27119970e-01 -7.34600052e-02  8.45173374e-02
 -4.076032

## 4.1 Computing the cosine similarity

We’ll compute a similarity matrix of each keyword to each topic. Cosine similarity has been shown to work well for word vector similarity, so we’ll compute cross-wise similarity and then assign each keyword to the topic it is most similar to.

In [650]:
from sklearn.metrics.pairwise import cosine_similarity

In [651]:
simple_sim = cosine_similarity(keyword_vectors, topic_vectors)
topic_idx = simple_sim.argmax(axis=1)
print(simple_sim)

[[0.34263933 0.2565931  0.3108404  0.1452301  0.25741756]
 [0.6251325  0.56232715 0.825355   0.5810569  0.80771744]
 [0.52496713 0.46872646 0.7803726  0.614789   0.58035344]
 ...
 [0.4013421  0.36484957 0.51594746 0.42516547 0.4331894 ]
 [0.6017383  0.54553306 0.7253492  0.63016653 0.5587209 ]
 [0.42670894 0.36803028 0.6208189  0.63371706 0.46583572]]


In [652]:
p=[]

# 5. Classifying

In [653]:
for k, i in zip(keywords, topic_idx):
  p.append((k, topic_labels[i]))

In [655]:
from numpy import array
final=array(p)

In [656]:
final

array([['peri peri wrap', 'Veg'],
       ['milk chocolate tub', 'Non-alcoholic beverages'],
       ['soft drinks large', 'Non-alcoholic beverages'],
       ...,
       ['thums up  ml', 'Non-alcoholic beverages'],
       ['fruit wine large', 'Non-alcoholic beverages'],
       ['coca cola  ml', 'Alcoholic beverages']], dtype='<U87')

This is the section where we could have concatenated id back with the classified dataset

In [725]:
#np.concatenate([final,df1])

# 6. Exporting final results

In [658]:
df = pd.DataFrame(final, columns = ['Item Name','Item Type'])

In [660]:
df.to_csv('data.csv', index = False)

In [661]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27715 entries, 0 to 27714
Data columns (total 2 columns):
Item Name    27715 non-null object
Item Type    27715 non-null object
dtypes: object(2)
memory usage: 433.1+ KB


## 6.1 Viewing the results

In [662]:
df.head(30)

Unnamed: 0,Item Name,Item Type
0,peri peri wrap,Veg
1,milk chocolate tub,Non-alcoholic beverages
2,soft drinks large,Non-alcoholic beverages
3,cheese omellete,Non-Veg
4,cheesy dip,Veg
5,blueberry cream cheese sw waffles,Desserts
6,cafe mocha,Non-alcoholic beverages
7,exotic veg with sauce,Veg
8,basket chaat,Veg
9,red paprika half,Veg


As we can see, many of the items are correctly classified. There are still some problems, in which there contains both keywords from more than one category. 

# 7. Conclusion

I used a totally different approach this time and the results are a bit satisfactory to the previous ones. 

There still persist some issues, and I will solve them.
    1. No data visualiations
    2. Getting a exact accuracy score
    3. Item_id issue

Almost all the pointers which you suggested have been taken into consideration:
    1. Meta-labelling 
    2. Removing text data like delivery charges, etc
    3. Deleting the item_ids(not solved)
    
Due to your helpful pointers, we got a better accuracy in this attempt.



**Thank you for reading**