# Table of Contents
1. [About Dataset](#about)<br>
    1.1. [Context](#context)<br>
    1.2. [Features](#features)<br>
    1.3. [Source](#data_source)<br>
2. [Project Objective](#project_objective)<br>
3. [Loading the Data](#loading_the_data)<br>
4. [Data Preparation and Processing](#data_processing)<br>
    4.1. [Format Correction](#format_correction)<br>
    4.2. [Data Standardization](#data_standardization)<br>
5. [Text Analysis](#text_analysis)<br>
    5.1. [Finding the Average Rating](#5.1)<br>
    5.2. [Exploring Data Associated with Higher-than-Average and Lower-than-Average Ratings](#5.2)<br>
    &emsp;&emsp;5.2.1. [Higher-than-Average Rating](#5.2.1)<br>
    &emsp;&emsp;5.2.2. [Lower-than-Average Rating](#5.2.2)<br>

## 1. About Dataset<a id='about'></a>

### 1.1 Context<a id='context'></a>

- Women’s Clothing E-Commerce dataset revolves around the reviews written by customers. 
- Reviews are free-form text containing different words where information can be systematically extracted to improve the business process, as well as, user experience
- This is real commercial data, but has been anonymized, and references to the company in the review text and body have been replaced with retailer.

### 1.2 Features<a id='features'></a>

- __Clothing ID__: Integer Categorical variable that refers to the specific piece being reviewed.
- __Age__: Positive Integer variable of the reviewers age.
- __Title__: String variable for the title of the review.
- __Review Text__: String variable for the review body.
- __Rating__: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- __Recommended IND__: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- __Positive Feedback Count__: Positive Integer documenting the number of other customers who found this review positive.
- __Division Name__: Categorical name of the product high level division.
- __Department Name__: Categorical name of the product department name.
- __Class Name__: Categorical name of the product class name.

### 1.3 Source<a id='data_source'></a>

- <a href='https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews'>Original Link</a>
- A file is downloaded as `csv`, and loaded into relational database system, `PostgreSQL`

## 2. Project Objective<a id='project_objective'></a>

- To quantitatively identify keywords that coorespond with __higher-than-average_ ratings or __lower-than-average__ ratings using text analytics

## 3. Loading the Data<a id='loading_the_data'></a>

In [1]:
%load_ext sql

In [3]:
connection_string = f"postgresql+psycopg2://{user}:{password}@{host}/{db}"

In [4]:
%sql $connection_string

In [5]:
%sql SELECT * FROM clothing_reviews LIMIT 1;

 * postgresql+psycopg2://postgres:***@localhost/lightdb
1 rows affected.


review_id,clothing_id,age,title,review,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
1,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,True,0,Initmates,Intimate,Intimates


## 4. Data Preparation and Processing<a id='data_processing'></a>

### 4.1. Format Correction<a id='format_correction'></a>

Queries below are run directly at pgAdmin4:<br>

__Fix the Format of the Double Quotes: Replace """" with " for title and review columns__<br>
- `UPDATE clothing_reviews SET title = REPLACE(title, '""""', '"') WHERE title LIKE '%""""%';`
- `UPDATE clothing_reviews SET review = REPLACE(review, '""""', '"') WHERE review LIKE '%""""%';`

__Original ID(review_id) starts from '0': Modify a ID column to start from '1'__<br>

- `UPDATE clothing_reviews SET review_id = review_id + 1000001;`
- `UPDATE clothing_reviews SET review_id = review_id - 1000000;`

### 4.2. Data Standardization<a id='data_standardization'></a>

- Remove the stop words (e.g. the, so, ...) and punctuation
- Convert the capitalization
- Remove forms and tenses to get tokens into their stems
- Store it into a temp table

In [6]:
%%sql
CREATE TEMP TABLE clothing_reviews_std AS (
    WITH cte_reviews AS (
        SELECT
            (
                TS_LEXIZE(
                    'english_stem',
                    UNNEST(
                        STRING_TO_ARRAY(
                            REGEXP_REPLACE(review, '[^a-zA-Z]+', ' ', 'g'), 
                            ' '
                        )
                    )
                )
            )[1] AS token,
            rating
        FROM
            clothing_reviews
    )
    SELECT
        *
    FROM
        cte_reviews
    WHERE
        token IS NOT NULL
);

 * postgresql+psycopg2://postgres:***@localhost/lightdb
654428 rows affected.


[]

## 5. Text Analysis<a id='text_analysis'></a>

### 5.1. Finding the Average Rating<a id='5.1'></a>

- Find the average rating associated with each token
- Filter out the noise
- Store the data into temp table

In [13]:
%%sql
CREATE TEMP TABLE clothing_reviews_tokens AS (
    SELECT
        token,
        AVG(rating) AS avg_rating
    FROM
        clothing_reviews_std
    GROUP BY
        token
    HAVING
        COUNT(1) > 5
    ORDER BY
        avg_rating DESC
);

 * postgresql+psycopg2://postgres:***@localhost/lightdb
3099 rows affected.


[]

In [10]:
%%sql
SELECT
    *
FROM
    clothing_reviews_tokens
LIMIT
    10;

 * postgresql+psycopg2://postgres:***@localhost/lightdb
10 rows affected.


token,avg_rating
suitcas,5.0
lipstick,5.0
airplan,5.0
punch,5.0
impact,5.0
decemb,5.0
express,5.0
epitom,5.0
apprehens,5.0
anniversari,5.0


- As per the table 'clothing_reviews_tokens', the text from the clothing reviews are tokenized into words and paired with associated average rating.

### 5.2. Exploring Data Associated with Higher-than-Average and Lower-than-Average Ratings<a id='5.2'></a>

#### 5.2.1. Higher-than-Average Rating<a id='5.2.1'></a>

##### Examining the Tokens with Higher-than-Average Rating

- 'superior', 'timeless' and 'discount' are chosen for an exploration
- __Assumptions__: these words are associated with good rating as they represent the design, quality, and promotion that customers desire

In [11]:
%%sql
SELECT
    *
FROM
    clothing_reviews_tokens
WHERE
    token IN ('superior', 'timeless', 'discount')
ORDER BY
    avg_rating DESC, token;

 * postgresql+psycopg2://postgres:***@localhost/lightdb
3 rows affected.


token,avg_rating
superior,4.888888888888888
timeless,4.826086956521739
discount,3.75


##### Assumptions Verification

Assumptions can be verified by using following queries and looked through the review text:
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%superior%';`
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%timeless%';`
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%discount%';`

`superior`<br>
__Average rating__: 4.88<br>
__Subject__: Quality

- __ID__: 735
- __Rating__: 4
- The fabric and detailing of this dress is of __superior quality__, but unfortunately it runs huge-- you definitely need to wear a tank or cami underneath. i am 5'9 145lbs with massive shoulders/smaller bust and i got the xs petite!
<br><br>
- __ID__: 2007<br>
- __Rating__: 5
- Received this dress yesterday, love it! my initial hesitation in ordering this dress, even though it is my fav color....was rather bright for sept.. but loved it and decided to order and wear it next spring. delighted to open the package and see that the dress is much darker than in the pics...more of a jade green. it is still a summer fabric/style, however the darker color will be perfect for the warmer fall days ahead. this dress is beautifully made, __superior workmanship__, and the fabric
<br><br>
- __ID__: 14251
- __Rating__: 5
- This cape is stunning! the perfect addition to any closet. it can be worn to a formal or out to dinner with jeans. the fur is so soft and the __quality is superior__. it has real fur hooks (four) to close it and the side pockets are adorable. definitely a buy and a bargain on sale!


`timeless`<br>
__Average rating__: 4.82<br>
__Subject__: Style and Design

- __ID__: 592
- __Rating__: 4
- This is a __timeless top__. i loved the overall look of the top and after eyeing it for several days decided to purchase it. warning, the shoulder area is cut small due to the fabric at the top. i am usually a size 12 (because of ddd chest) but i bumped up to a 14 and it was perfect.
<br><br>
- __ID__: 1049
- __Rating__: 5
- I love the style and quality of this blouse. it can easily be dressed up or down. the blouse is completely see through and delicate. still, it is so __romantic, feminine, distinctive and timeless__.
<br><br>
- __ID__: 6143
- __Rating__: 5
- So this is actually a holding horses dress, its mislabeled on the website. tts, a stunning vintage inspired piece, __very 40's but timeless__. the keyhole isn't too low cut, this is appropriate for work. beautiful drape and great lining - this dress can twirl! i am usually a xsmall/small and went with the small, fits perfectly though a bit loose in the waist but still very flattering - the xsmall would have been too tight in the shoulders. lovely dress

`discount`<br>
__Average rating__: 3.75<br>
__Subject__: Sales

- __ID__: 141
- __Rating__: 5
- Perfect for work or going out. i layered this with the reversible tank in medium pink so it would be work appropriate. it did not feel scratchy to me, maybe because i layered it. __great buy especially with the discounts__. feel like i lucked out.
<br><br>
- __ID__: 2970
- __Rating__: 5
- I __ordered this on a whim during the 40% discount upon sale price__. i was a bit worried it looked like a burlap sack initially, but after trying it on with an under tank and leggings i decided it was a keeper. its quite soft and is flattering. i usually wear a medium, got the m/l and fits well.
<br><br>
- __ID__: 
- __Rating__: 4
- This shirt is comfy, fits well, the color (pink or dusty rose) works well with most outfits, but the front hook is ill-placed and often visible. it could be tucked under better so as not to show, but it isn't and does show. i feel that i frequently need to fluff it up to hide the hook. otherwise, i like it, but i __wouldn't have bought it unless it was on discount__, which it was.

#### 5.2.2. Lower-than-Average Rating<a id='5.2.2'></a>

##### Examining the Tokens with Higher-than-Average Rating

- 'halloween', 'abdomen' and 'refund' are chosen for an exploration
- __Assumptions__: these words are associated with bad rating as they represent the particular style issue, and refund situation

In [14]:
%%sql
SELECT
    *
FROM
    clothing_reviews_tokens
WHERE
    token IN ('halloween', 'abdomen', 'refund')
ORDER BY
    avg_rating DESC, token;

 * postgresql+psycopg2://postgres:***@localhost/lightdb
3 rows affected.


token,avg_rating
abdomen,2.142857142857143
halloween,2.0
refund,2.0


##### Assumptions Verification

Assumptions can be verified by using following queries and looked through the review text:
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%halloween%';`
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%abdomen%';`
- `SELECT * FROM clothing_reviews WHERE review ILIKE '%refund%';`

`halloween`<br>
__Average rating__: 2<br>
__Subject__: Similarity with Costume
- __ID__: 3761
- __Rating__: 1
- I tried this dress on in store. i loved it online, but in person....not so much. it looked and __felt like a cheap halloween costume__. the fabric was bad. really bad.. the mustard color was beautiful.... boo
<br><br>
- __ID__: 15420
- __Rating__: 1
- I took the advise of a reviewer and sized down to xs. i am 5'6 and 110. i could take them off without unbuttoning them, they were hugh, even the xs! they __looked like a halloween costume__. even at the discounted sale price, awful and kind of made me laugh. but i guess that's why they are still available at this sale price.
<br><br>
- __ID__: 20620
- __Rating__: 2
- I ordered a small from the online shop knowing it would be drapey, but it's much more than drapey. and the front draping __looks like halloween costume material__ - not chic or sophisticated. sadly, i will be returning this top.

`abdomen`<br>
__Average rating__: 2.14<br>
__Subject__: Styling around Abdomen Area
- __ID__: 11547
- __Rating__: 2
- I anticipated receiving this skirt as i love a longer pencil skirt and this looks beautiful. i opened it, tried it on, sent it back...it is well made but has too many negatives. first, it is very thick. i ordered size 6 (5'2", 140 lbs.). the size was good, but it is so thick that it adds __lots of extra fluff around the waist and abdomen__. it is also very stretchy and fits in an unflattering way. finally, and most importantly, it has a back and front vent and both are vented all the way up to the c
<br><br>
- __ID__: 18757
- __Rating__: 1
- This dress comes with a built in fupa. literally there is __extra material at the lower abdomen that sticks out and is very unflattering__.
<br><br>
- __ID__: 22886
- __Rating__: 1
- Bummer, i loved the lacy bell arms and everything about the look of this sweater but it is __huge in the abdomen__. seriously looks maternity to me. the neck is also too wide. the quality of the material is very nice and the weight is nice and heavy but the cut just didn't work on me. i returned this one.

`refund`<br>
__Average rating__: 2<br>
__Subject__: Refund
- __ID__: 4062
- __Rating__: 2
- I ordered this blouse because it was such a good price on sale but should have paid closer attention to the reviews. the slits on the sleeves are much more noticeable than in the photo and the blouse just didn't work for me. something about it reminded me of seinfeld's pirate shirt. i appreciate __retailer's return policy__. i took it back to the store and got an immediate refund on my cc.
<br><br>
- __ID__: 20204
- __Rating__: 2
- Unfortunately, this dress is shaped like a sack and has no shape to speak of. i have __sent it back for a refund__ :( too big, too shapeless.
<br><br>
- __ID__: 8159
- __Rating__: 2
- Would be flattering on someone who is slim with all the right curves. i'm average size and a mother of 3. the medium was not a true medium. the fabric was amazing. the arm holes were too open/low cut for my taste.  i'm 5'7" and 140. this landed mid calf. not my preference.   retailer accepted the return with a __hassel free refund__ all under 2 weeks from receiving it.