# FIT5149 S2 2019 Assessment : Sentiment Classification for Product Reviews

Group information
- Group No: 52

Programming Language: Python in Jupyter Notebook

## 1. Introduction

#### In this data analysis challenge, we are interested in developing such an automatic sentiment classification system that relies on machine learning techniques to learn from a large set of product reviews provided by Yelp.
#### The levels of polarity of opinion we consider include strong negative, weak negative, neutral, weak positive, and strong positive. 
#### The aim of this challenge is to develop a sentiment classifier that can assign a large set of product reviews to the five levels of polarity of opinion as accurately as possible, given a small amount of labeled reviews and a large amount of unlabelled reviews

*Steps performed for building Sentiment classifier: <br>
1.Pre-processing and cleaning the labeled, unlabeled and test data.<br>
2.Creating features using Count Vectorizer. <br>
3.Applying Logistic Regression to the labeled data. <br>
4.Retraining the model on labeled and certain % of unlabeled data. <br>
5.Predicting labels for test data.*

### Importing Packages

In [1]:
# importing packages.
import itertools
import pandas as pd
import re

# import packages for pre-processing 
import nltk
from nltk.tokenize import RegexpTokenizer
import nltk.data
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer

import packages for pre-processing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import MultinomialNB

# import packages for pipielining
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline


### Loading the data

In [1]:
label_data = pd.read_csv("labeled_data.csv",encoding='utf-8')        # reading label data
unlabeled_data = pd.read_csv("unlabeled_data.csv",encoding='utf-8')  # reading unlabeled data
test_data = pd.read_csv("test_data.csv",encoding='utf-8')            # reading test_data data

NameError: name 'pd' is not defined

## 2. Pre-processing data

Pre-processing and cleaning data is a vital part of data analysis, text data contains a lot of unwanted words which are not important or are special or new characters. <br> 
Below given function,
1. Removes uncodeable characters, retaining rest of the data.
2. Removes new line characters.
3. Removes unwanted space, special characters.
4. Stemming of the data, basically trims the words, hence extracting the orginal words.


In [6]:
# function for cleaning text.
def clean_text1(data):
    
    clean=data['text'].encode('ascii', 'ignore').decode("utf-8")
    clean=clean.lower()

    # remove all single characters
    processed_feature= re.sub(r'\\[rn]', ' ',clean)
    processed_feature = re.sub(r'[^a-zA-Z0-9]+',' ',processed_feature)
    word_list = processed_feature.split(" ")
    
    # steming the data
    stemmer = PorterStemmer()
    stemmed_list = []
    for word in word_list:
        stemmed_list.append(stemmer.stem(word))
    
    text = " ".join(stemmed_list)
   
    return text

In [7]:
# pre-processing the data
label_data['clean_text'] = label_data.apply(clean_text1,axis=1)
label_data['ID'] = "Labeled"
label_data['test_id'] = ""      # new-column for test id.

# pre-processing the data
unlabeled_data['clean_text'] = unlabeled_data.apply(clean_text1,axis=1)
unlabeled_data['ID'] = "Unlabeled"
unlabeled_data['label'] = ""
unlabeled_data['test_id'] = ""  # new-column for test id

# pre-processing the data
test_data['clean_text'] = test_data.apply(clean_text1,axis=1)
test_data['ID'] = "Test"
test_data['label'] = ""         # new-column for test id

## 3. Model Building

For Model building, we need to first prepare our data according to the input criteria of the algorithms.
Following are steps performed for preparing the data and model building: -
1. Clean text converted into features using Count vectorizer.
2. Removing stop words, using tokens of of 3 or more characters, forming uni-bi-trigrams.
3. Applying Logistic regression to the labeled data, then implementing it to the unlabeled data to predict the labels.

In Semi-Supervised learning, we often encounter small label data and large unlabeled data. <br> ** Hence, after predicting the labels for unlabeled data, we will be including 1 lakh rows of unlabeled data with labels into the labeled data and again train it. <br>
Label data contained 50k rows, therefore better training of the model, we took 1 lakh rows.**

The retrained model is then implemented on the given test data.

In [26]:
# creating train and test for model building.
X_train = label_data.clean_text
y_train = label_data.label

X_unlabel = unlabeled_data.clean_text
X_test = test_data.clean_text

AttributeError: 'DataFrame' object has no attribute 'clean_text'

###  Logistic Regression

- We would be implementing feature generation i.e Count Vectorizer and Logistic Regression in a pipieline to process the code faster.
- Steps 1 and 2 are performed with the parameters provided in the Count Vectorizer.
- n_jobs distributes the process into 4 cores, C value penalises the data for larger no of features, 'lbfgs' method is better for multiclass classification as well as larger dataset.

In [22]:
# building the model through a pipeline.
logreg = Pipeline([
                
                # converts clean text into bag of words.
                ('cntvector', CountVectorizer(stop_words='english', # removes stop words from the text.
                                    token_pattern=r'\w{3,}',        # accept tokens that have 3 or more characters
                                    analyzer='word',
                                    ngram_range=(1, 3))),           # forms uni, bi, and trigrams
    
                # appplying logistic regression
                ('clf', LogisticRegression(n_jobs=4, C=0.5, solver='lbfgs',multi_class='multinomial')),
               ])

# Fit the model to the data.
logreg.fit(X_train, y_train)

# predicting on the unlabeled data.
y_unlabel = logreg.predict(X_unlabel)

### Combining the predicted data to the label data.

In [None]:
new_label_data = unlabeled_data
new_label_data['label'] = y_unlabel          # appending the predicted values.

combine = label_data
combine=combine.append(new_label_data[:100000], ignore_index = True)   # concat new_label_data with label_data.

# preparing x,y for retraining the model.
X_combine = combine.clean_text
y_combine = combine.label

In [None]:
# Fit the model to the new data.
logreg.fit(X_combine, y_combine)

# predicting on the given test data.
y_test = logreg.predict(X_test)

### Preparaing data for evaluation.

In [23]:
eval_data = test_data
eval_data['label'] = y_test     # appending the prediction column to the test data

final = eval_data[['test_id','label']]    # renaming the columns.

#### Exporting the data

In [25]:
final.to_csv("predict_label.csv",index=False)