## Text Classification on E-commerce Dataset using tf-idf vectorization 
## for Classification we are using three algorithm 
* KNN
* Random Forest 
* Naive Bayes 
### we are comparing performace on all of this three classifier.
### we are also comparing performance between processed data and unprocessed data 

In [36]:
import pandas as pd
df=pd.read_csv("Ecommerce_data.csv")
print(df.shape)

(24000, 2)


In [37]:
df.head()

Unnamed: 0,Text,label
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [40]:
df.label.value_counts()

Household                 6000
Electronics               6000
Clothing & Accessories    6000
Books                     6000
Name: label, dtype: int64

In [43]:
df['label_num']=df.label.map({
    'Household':0,
    'Books':1,
    'Electronics':2,
    'Clothing & Accessories' :3
})

In [44]:
df.head()

Unnamed: 0,Text,label,label_num
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3


In [47]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(
df.Text,
df.label_num,
test_size=0.2,
random_state=2023,
stratify=df.label_num
)

In [48]:
# printing shapes
print("shapes of x_train :", X_train.shape)
print("shapes of y_train : ",X_test.shape)

shapes of x_train : (19200,)
shapes of y_train :  (4800,)


In [50]:
y_train.value_counts()

2    4800
1    4800
0    4800
3    4800
Name: label_num, dtype: int64

## we will use different classifier from sklearn to train model 

### KNN Classifier 

In [53]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('KNN',KNeighborsClassifier())
])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96      1200
           1       0.97      0.95      0.96      1200
           2       0.96      0.96      0.96      1200
           3       0.97      0.98      0.97      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



In [69]:
#now we will verify with some of our data 
print(X_test[:5])
print(y_test[:5])

17076    U.S. Polo Assn. Men's Cotton Boxers Feel at ea...
4457     Tribal India by Nadeem Hasnain Product Conditi...
431      Gone: Jack Caffery series 5 Review "Lacerating...
1272     Revise Anatomy in 15 Days (New SARP Series for...
17092    CalCore Luxury Comfort Orthopedic Memory Foam ...
Name: Text, dtype: object
17076    3
4457     1
431      2
1272     1
17092    0
Name: label_num, dtype: int64


In [70]:
y_pred[:5]

array([3, 1, 1, 1, 0], dtype=int64)

#### it is giving 4/5 correct answere 

## Multinomial Classifier 

In [73]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('Multinomial',MultinomialNB())
])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1200
           1       0.98      0.93      0.95      1200
           2       0.97      0.96      0.97      1200
           3       0.97      0.98      0.98      1200

    accuracy                           0.96      4800
   macro avg       0.96      0.96      0.96      4800
weighted avg       0.96      0.96      0.96      4800



## Random Forest Classifier 

In [75]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('KNN',RandomForestClassifier())
])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.97      0.96      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.97      0.98      1200
           3       0.98      0.98      0.98      1200

    accuracy                           0.97      4800
   macro avg       0.97      0.97      0.97      4800
weighted avg       0.97      0.97      0.97      4800



In [76]:
import spacy
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(text)
    filtered_tokens =[]
    for  token in doc:
        if token.is_stop or token.is_punct:# removing stop words and punctuation 
            continue
        filtered_tokens.append(token.lemma_) #Lemmatizing the word
    return " ".join(filtered_tokens)

In [78]:
df['preprocessed_text']=df['Text'].apply(preprocess)
#apply and map are kind of simillar function

In [79]:
df.head()

Unnamed: 0,Text,label,label_num,preprocessed_text
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household,0,Urban Ladder Eisner Low Study Office Computer ...
1,"Contrast living Wooden Decorative Box,Painted ...",Household,0,contrast live Wooden Decorative Box Painted Bo...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics,2,IO Crest SY pci40010 pci RAID Host Controller ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories,3,ISAKAA Baby Socks bear 8 Years- Pack 4 6 8 12 ...
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories,3,Indira Designer Women Art Mysore Silk Saree Bl...


In [80]:
df.preprocessed_text[0]

'Urban Ladder Eisner Low Study Office Computer chair(black study simple Eisner study chair firm foam cushion make long hour desk comfortable flexible mesh design air circulation support lean curved arm provide ergonomic forearm support adjust height gas lift find comfortable position nylon castor easy space chrome leg refer image dimension detail assembly require UL team time delivery indoor use'

## applying random forest on preprocessed data

In [82]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(
df.preprocessed_text,
df.label_num,
test_size=0.2,
random_state=2023,
stratify=df.label_num
)

In [84]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_tfidf',TfidfVectorizer()),
    ('KNN',RandomForestClassifier())
])
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1200
           1       0.98      0.98      0.98      1200
           2       0.98      0.98      0.98      1200
           3       0.98      0.99      0.98      1200

    accuracy                           0.98      4800
   macro avg       0.98      0.98      0.98      4800
weighted avg       0.98      0.98      0.98      4800



### we have find  that preproceesed data have better performance than normal text

## Conclusion 
### pre-processed data have better performance than normal text.
### Performance of different classifier( KNN , Naive Bayes and Random Forest on this dataset is as followed ).
#### Random Forest > Naive Bayes > KNN.