<a href="https://colab.research.google.com/github/kieramurphy37/DS4002_CS3/blob/main/Scripts/MI2_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install VADER package

In [None]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


Import data from github

In [None]:
url = 'https://raw.githubusercontent.com/Sabrina-Hendricks/DS4002-Group13/main/Data/Womens%20Clothing%20E-Commerce%20Reviews.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Drop irrelevant columns, empties, and duplicates

In [None]:
df.rename(columns={'Unnamed: 0': 'Review ID'}, inplace=True)
df = df.drop(columns=['Title']) #There are a lot of empty title names and we don't care about this column
df.head()

Unnamed: 0,Review ID,Clothing ID,Age,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [None]:
df = df.dropna()  # Drops rows where any value is NaN
df = df.dropna(how='all') #Drop rows that are completely empty
df = df.drop_duplicates() #Drop duplicates
df.shape

(22628, 10)

Feature Engineering

In [None]:
# Review length (number of words) - we believe this could be an interesting variable to investigate
df['Review Length'] = df['Review Text'].apply(lambda x: len(x.split()))

In [None]:
df['Age'].describe()

Unnamed: 0,Age
count,22628.0
mean,43.28288
std,12.328176
min,18.0
25%,34.0
50%,41.0
75%,52.0
max,99.0


In [None]:
# Age groups - this will be helpful in analyzing correlation between age and review sentiment
age_bins = [0, 24, 34, 44, 54, 64, np.inf]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']

df['Age Group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels)

In [None]:
# One-hot encode the categorical columns for regression
encoded_columns = pd.get_dummies(df[['Division Name', 'Department Name', 'Class Name', 'Age Group']], drop_first=True)

# Concatenate the new one-hot encoded columns with the original DataFrame
df = pd.concat([df, encoded_columns], axis=1)


In [None]:
df.head()

Unnamed: 0,Review ID,Clothing ID,Age,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,...,Class Name_Skirts,Class Name_Sleep,Class Name_Sweaters,Class Name_Swim,Class Name_Trend,Age Group_25-34,Age Group_35-44,Age Group_45-54,Age Group_55-64,Age Group_65+
0,0,767,33,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates,...,False,False,False,False,False,True,False,False,False,False
1,1,1080,34,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,...,False,False,False,False,False,True,False,False,False,False
2,2,1077,60,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,...,False,False,False,False,False,False,False,False,True,False
3,3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,...,False,False,False,False,False,False,False,True,False,False
4,4,847,47,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,...,False,False,False,False,False,False,False,True,False,False


Add sentiment scores to cleaned dataframe

In [None]:
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Apply VADER sentiment analysis
df['Sentiment Score'] = df['Review Text'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

In [None]:
# Reorder columns to place 'Sentiment Score' after 'Review Text'
columns = df.columns.tolist()  # Get the list of columns
columns.insert(columns.index('Review Text') + 1, columns.pop(columns.index('Sentiment Score')))  # Move 'Sentiment Score'
df = df[columns]  # Reorder the DataFrame

In [None]:
df.head()

Unnamed: 0,Review ID,Clothing ID,Age,Review Text,Sentiment Score,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,...,Class Name_Skirts,Class Name_Sleep,Class Name_Sweaters,Class Name_Swim,Class Name_Trend,Age Group_25-34,Age Group_35-44,Age Group_45-54,Age Group_55-64,Age Group_65+
0,0,767,33,Absolutely wonderful - silky and sexy and comf...,0.8932,4,1,0,Initmates,Intimate,...,False,False,False,False,False,True,False,False,False,False
1,1,1080,34,Love this dress! it's sooo pretty. i happene...,0.9729,5,1,4,General,Dresses,...,False,False,False,False,False,True,False,False,False,False
2,2,1077,60,I had such high hopes for this dress and reall...,0.9208,3,0,0,General,Dresses,...,False,False,False,False,False,False,False,False,True,False
3,3,1049,50,"I love, love, love this jumpsuit. it's fun, fl...",0.5727,5,1,0,General Petite,Bottoms,...,False,False,False,False,False,False,False,True,False,False
4,4,847,47,This shirt is very flattering to all due to th...,0.9291,5,1,6,General,Tops,...,False,False,False,False,False,False,False,True,False,False


Export data to CSV

In [None]:
#This will download the CSV to Colab files. We then downloaded to our computer and uploaded to Github. This step is not necessary in reproducing as the future scripts reference the uploaded csv.
df.to_csv('Cleaned_Data.csv', index=False)