# Text Classification

## Dataset Gathering

### Amazon review dataset for clothing, shoes and jwellery
http://jmcauley.ucsd.edu/data/amazon/

In [23]:
import pandas as pd
import string

input_data = pd.read_csv("reviews_Clothing_Shoes_and_Jewelry_5.csv")
input_data['overall'] = input_data['overall'].astype(object) # fix datatype error
input_data['reviewText'] = input_data['reviewText'].astype(object) # fix datatype error

dataset = {"reviewText": input_data["reviewText"], "overall": input_data["overall"]  }
dataset = pd.DataFrame(data = dataset)
dataset = dataset.dropna()

dataset.head(5)

Unnamed: 0,reviewText,overall
0,This is a great tutu and at a really great pri...,5.0
1,I bought this for my 4 yr old daughter for dan...,5.0
2,What can I say... my daughters have it in oran...,5.0
3,"We bought several tutus at once, and they are ...",5.0
4,Thank you Halo Heaven great product for Little...,5.0


### Labelling data positive (1 if rating is 4, 5) / negative (-1 if rating is 1,2) based on overall ratings
### Ignoring rating 3 since they are neutral

In [24]:
dataset = dataset[dataset["overall"] != '3'] # need datatype=object
dataset["label"] = dataset["overall"].apply(lambda rating : +1 if str(rating) > '3' else -1)
dataset.head(5)

Unnamed: 0,reviewText,overall,label
0,This is a great tutu and at a really great pri...,5.0,1
1,I bought this for my 4 yr old daughter for dan...,5.0,1
2,What can I say... my daughters have it in oran...,5.0,1
3,"We bought several tutus at once, and they are ...",5.0,1
4,Thank you Halo Heaven great product for Little...,5.0,1


### Test & Train split

In [34]:
from sklearn.model_selection import train_test_split

X = pd.DataFrame(dataset, columns = ["reviewText"])
y = pd.DataFrame(dataset, columns = ["label"])

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=50)

## Text Processing

### Tokenize, Stem and lemmatize
### Bag of words model

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b', stop_words="english")
# vectorizer = CountVectorizer(token_pattern=r'\b\w+\b', analyzer='word', ngram_range=(2, 2))
train_vector = vectorizer.fit_transform(train_X["reviewText"])
test_vector = vectorizer.transform(test_X["reviewText"])
# print(train_vector, train_vector)

## Logistic Regression for classification

In [62]:
from sklearn.linear_model import LogisticRegression

clr = LogisticRegression()
clr.fit(train_vector, train_y.values.ravel())
scores = clr.score(test_vector, test_y)
print(scores)

0.9345285943959577


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Naive Bayers for Classification

In [64]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(train_vector, train_y.values.ravel())
scores = mnb.score(test_vector, test_y)
print(scores)

0.9162408130454754


## Final analysis of scores on different vectorization methods

| Count Vectorizer Constrains    | Time taken to process | Logistic Regression Score | Naive Bayers Score |
|--------------------------------|-----------------------|---------------------------|--------------------|
| tf-idf weighted                | 21 secs               | 0.93                      | 0.91               |
| with english stop words        | 20 secs               | 0.92                      | 0.92               |
| with word analyzer and bi-gram | 1.36 minutes          | 0.93                      | 0.91               |