## Global variables

In [1]:
# Global variables
FILENAME = "Womens Clothing E-Commerce Reviews.csv"
ASPECTS = ["dress", "love", "fit", "size", "top", "color", "look", "wear", "fabric", "cute", "flattering", "comfortable"]

## Define the Class AspectSentimentAnalyzer

In [2]:
import pandas as pd
from nltk import sent_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from preprocessing import preprocessText

# Global variables
FILENAME = "../data/Womens Clothing E-Commerce Reviews.csv"
ASPECTS = ["dress", "love", "fit", "size", "top", "color", "look", "wear", "fabric", "cute", "flattering", "comfortable"]

class AspectSentimentAnalyzer:
    def __init__(self):
        # Load and preprocess the data
        self.df = pd.read_csv(FILENAME)
        self.df.drop(columns=self.df.columns[0], axis=1, inplace=True)
        self.df['Combined_Text'] = self.df['Review Text'].fillna('') + ' ' + self.df['Title'].fillna('')
        self.df['Weight'] = self.df['Positive Feedback Count'].fillna(0) + 1
        self.sia = SentimentIntensityAnalyzer()
        self._prepare_sentiments()

    def _find_aspect_sentences(self, text, aspect):
        return [sentence for sentence in sent_tokenize(text) if aspect in sentence]

    def _analyze_aspect_sentiment(self, aspect_sentences):
        if not aspect_sentences:
            return 0
        scores = [self.sia.polarity_scores(sentence)['compound'] for sentence in aspect_sentences]
        return sum(scores) / len(scores)

    def _prepare_sentiments(self):
        for aspect in ASPECTS:
            self.df[aspect + '_sentences'] = self.df['Combined_Text'].apply(lambda text: self._find_aspect_sentences(text, aspect))
            self.df[aspect + '_sentiment'] = self.df[aspect + '_sentences'].apply(self._analyze_aspect_sentiment)
            self.df[aspect + '_weighted_sentiment'] = self.df[aspect + '_sentiment'] * self.df['Weight']

        self._aggregate_sentiments()

    def _aggregate_sentiments(self):
        agg_columns = {aspect + '_weighted_sentiment': 'sum' for aspect in ASPECTS}
        agg_columns['Weight'] = 'sum'
        self.df_grouped = self.df.groupby('Clothing ID').agg(agg_columns).reset_index()

        for aspect in ASPECTS:
            weighted_col = aspect + '_weighted_sentiment'
            self.df_grouped[aspect + '_sentiment'] = self.df_grouped[weighted_col] / self.df_grouped['Weight']
            self.df_grouped.drop(columns=[weighted_col], inplace=True)

    def get_ranking_by_aspect(self, input_aspects, N=10):
        def calculate_combined_score(row, aspect_cols):
            non_zero_scores = [score for score in row[aspect_cols] if score != 0]
            return sum(non_zero_scores) / len(non_zero_scores) if non_zero_scores else 0

        aspect_cols = [aspect + '_sentiment' for aspect in input_aspects]
        df_selected = self.df_grouped[['Clothing ID'] + aspect_cols]
        df_selected['combined_score'] = df_selected.apply(calculate_combined_score, axis=1, args=(aspect_cols,))
        df_ranked = df_selected.sort_values(by='combined_score', ascending=False)
        return df_ranked[['Clothing ID', 'combined_score']].head(N)

In [3]:
analyzer = AspectSentimentAnalyzer()
print(analyzer.get_ranking_by_aspect(["color", "fit", "fabric"]))

      Clothing ID  combined_score
527           527         0.98870
791           791         0.96800
580           580         0.96030
57             57         0.94640
503           503         0.92730
1161         1161         0.92495
1194         1194         0.92300
140           140         0.92090
487           487         0.91860
224           224         0.91840


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


# Feedbacks

We will test the results manually by evaluating if the returned results are indeed what we need

## Test 1

For Input as "color", "fit", "fabric" as above, we examined the reviews associated with the returned Clothing IDs, and found that the 10 returned results seem to be relevant overall.

Majority of the reviews are related, e.g., "I love the sweatshirt. clay color is very different it's a nice light fabric with nice detailed edges although it is an oversized piece it hangs and fits well although i am petite. great light sweatshirt for spring and summer" for Clothing ID 580. 

However, there is indeed a review that is a bit contraversial -- Clothing ID 527 "Cute bright colored suit a little more orange than pink i prefer pink but the fit didn't quite work for me. the bottoms were a little too big and the top although was fairly well proportioned just didn't look right altogether purchase the extra smalls in both  and have since returned". It has a positive review for color. But the negative for fit.


## Test 2


In [4]:
print(analyzer.get_ranking_by_aspect(["comfortable"]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


      Clothing ID  combined_score
791           791          0.9680
387           387          0.9509
776           776          0.9266
499           499          0.9246
309           309          0.9144
311           311          0.8977
1136         1136          0.8934
297           297          0.8807
85             85          0.8807
646           646          0.8805


For Input as "color", "fit", "fabric" as above, we examined the reviews associated with the returned Clothing IDs, and found that the 10 returned results seem to be relevant overall as well, indicating high precision.

Majority of the reviews are related and should be returned, e.g., "These pants are very comfortable i love putting my phone and wallet in the front big pocket so cool the linen is soft fits great" for Clothing ID 791. There is just one which is not so relevant -- Clothing ID 856 with review "You won't break the bank with this cute tee. loved that it's not a thin fabric. the variation in the stripes adds interest, too. i ordered a large, my regular size, and it fit well. i returned because it was a bit on the long side on my short frame, the colors didn't suit me and i need long sleeve tees now that we're heading into the colder months"