In this Notebook we will apply Linear Regression on Reviews dataset to predict wheather its a positive or negative comment.


In [1]:
# Here we are importing libraries which we'll use 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression


## About Dataset

Downloaded the data set from the Sentiment Labelled Sentences Data Set from the UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

This data set includes labeled reviews from IMDb, Amazon, and Yelp. Each review is marked with a score of 0 for a negative sentiment or 1 for a positive sentiment.


## Reading Dataset

In [26]:
filepath_dict = {'yelp':   r'F:\Machine Learning\NLP\Sentimental analysis\Dataset\\yelp_labelled.txt',
                 'amazon': r'F:\Machine Learning\NLP\Sentimental analysis\Dataset\\amazon_cells_labelled.txt',
                 'imdb':   r'F:\Machine Learning\NLP\Sentimental analysis\Dataset\\imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

reviews_dataset = pd.concat(df_list)
print("Total number of reviews_dataset in this dataset :",reviews_dataset.shape[0])
print("Total number of attributes in this dataset :",reviews_dataset.shape[1])
individual_review_counts = reviews_dataset.source.value_counts()
print(individual_review_counts)
for val,count in zip(individual_review_counts.index, individual_review_counts):
    print("Total number of reviews_dataset under {} is {}".format(val,count))
reviews_dataset.head(5)

Total number of reviews_dataset in this dataset : 2748
Total number of attributes in this dataset : 3
yelp      1000
amazon    1000
imdb       748
Name: source, dtype: int64
Total number of reviews_dataset under yelp is 1000
Total number of reviews_dataset under amazon is 1000
Total number of reviews_dataset under imdb is 748


Unnamed: 0,sentence,label,source
0,Wow... Loved this place.,1,yelp
1,Crust is not good.,0,yelp
2,Not tasty and the texture was just nasty.,0,yelp
3,Stopped by during the late May bank holiday of...,1,yelp
4,The selection on the menu was great and so wer...,1,yelp


Now we'll define the functions which we will use to further in this notebook

In [37]:
# def split_dataset(reviews,labels):
#     reviews_train, reviews_test, labels_train,labels_test = train_test_split(
#         reviews,labels, test_size = 0.25, random_state = 100 )
#     return reviews_train, reviews_test, labels_train,labels_test

def generate_BOW(reviews_train, reviews_test):
    vectorizer = CountVectorizer()
    vectorizer.fit(reviews_train)

    reviews_train = vectorizer.transform(reviews_train)
    reviews_test = vectorizer.transform(reviews_test)
    
    return reviews_train,reviews_test

def train_linear_regression(reviews_train, labels_train):
    classifier = LogisticRegression()
    classifier.fit(reviews_train, labels_train)

    return classifier

def calculate_accuracy(classifier,reviews_test,labels_test):
    score = classifier.score(reviews_test,labels_test)
    return score

def run(reviews,labels, source = "compllete"):
    reviews_train, reviews_test, labels_train,labels_test = train_test_split(
        reviews,labels, test_size=0.25, random_state=1000 )

    reviews_train,reviews_test = generate_BOW(reviews_train,reviews_test)
    classifier = train_linear_regression(reviews_train,labels_train)
    score = calculate_accuracy(classifier, reviews_test,labels_test)
    print("Accuracy for {} dataset is {:.3f}".format(source,score))



In [38]:
for source in reviews_dataset['source'].unique():
    df_source = reviews_dataset[reviews_dataset['source'] == source]
    reviews = df_source['sentence'].values
    labels = df_source['label'].values
    run(reviews,labels,source = source)

Accuracy for yelp dataset is 0.796
Accuracy for amazon dataset is 0.796
Accuracy for imdb dataset is 0.749


In above cells we have train the model for individual sources now we'll train it for all the sources

In [39]:
reviews = reviews_dataset["sentence"].values
labels = reviews_dataset["label"].values
run(reviews,labels)

Accuracy for compllete dataset is 0.820
