## **Variables with Open Text**

We want to use vectorization. Vectorization can be just simple counts of tokens or fractions of tokens to show how many times a particular word (token) appears. Each token is now a variable. Dimensions will be a lot higher when using vectorization.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/martinwg/ISA591/main/data/clothing_reviews.csv')
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [2]:
df['Review Text'].nunique()

22634

In [3]:
df.isna().sum()

Unnamed: 0,0
Clothing ID,0
Age,0
Title,3810
Review Text,845
Rating,0
Recommended IND,0
Positive Feedback Count,0
Division Name,14
Department Name,14
Class Name,14


In [5]:
df.dropna(inplace = True) ## drops all obs with missing vals

In [6]:
df['Division Name'].value_counts()

Unnamed: 0_level_0,count
Division Name,Unnamed: 1_level_1
General,11664
General Petite,6778
Initmates,1220


In [15]:
## vectorizer options: CountVectorizer (bag of words), TF-IDF Vectorizer (frequency vectorizer)
from sklearn.feature_extraction.text import CountVectorizer

## create an instance
## get rid of common stopwords (I, you, with, and, or,)
vectorizer = CountVectorizer(stop_words = "english", max_features=200)

## vectorize only the text-based variable(s)
X = vectorizer.fit_transform(df['Review Text'])

In [16]:
## X is saved into a Sparse matrix (matrix with many zeros)
X

<19662x200 sparse matrix of type '<class 'numpy.int64'>'
	with 261290 stored elements in Compressed Sparse Row format>

In [17]:
## change to array
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [18]:
## tokens (variables)
vectorizer.get_feature_names_out()

array(['10', 'absolutely', 'actually', 'area', 'arms', 'beautiful',
       'better', 'big', 'bit', 'black', 'blouse', 'blue', 'body', 'boots',
       'bought', 'boxy', 'bra', 'bust', 'buttons', 'buy', 'casual',
       'chest', 'color', 'colors', 'comfortable', 'comfy', 'compliments',
       'cut', 'cute', 'day', 'decided', 'definitely', 'design', 'did',
       'didn', 'different', 'does', 'doesn', 'don', 'dress', 'dressed',
       'dresses', 'easy', 'extra', 'fabric', 'fall', 'feel', 'feels',
       'felt', 'fine', 'fit', 'fits', 'fitted', 'flattering', 'flowy',
       'fun', 'glad', 'going', 'good', 'gorgeous', 'got', 'great',
       'green', 'happy', 'high', 'hips', 'jacket', 'jeans', 'just',
       'lace', 'large', 'lbs', 'leggings', 'length', 'light', 'like',
       'little', 'll', 'long', 'longer', 'look', 'looked', 'looking',
       'looks', 'loose', 'lot', 'love', 'loved', 'lovely', 'low', 'make',
       'makes', 'material', 'medium', 'model', 'navy', 'neck', 'need',
       'nic

In [19]:
## PCA on those predictors
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

Unnamed: 0,10,absolutely,actually,area,arms,beautiful,better,big,bit,black,...,weight,went,white,wide,wish,wore,work,worn,worth,xs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19657,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19658,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19659,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19660,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [20]:
df = pd.concat([df, pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())], axis = 1)

In [21]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,...,weight,went,white,wide,wish,wore,work,worn,worth,xs
2,1077.0,60.0,Some major design flaws,I had such high hopes for this dress and reall...,3.0,0.0,0.0,General,Dresses,Dresses,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1049.0,50.0,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5.0,1.0,0.0,General Petite,Bottoms,Pants,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,847.0,47.0,Flattering shirt,This shirt is very flattering to all due to th...,5.0,1.0,6.0,General,Tops,Blouses,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,1080.0,49.0,Not for the very petite,"I love tracy reese dresses, but this one is no...",2.0,0.0,4.0,General,Dresses,Dresses,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,858.0,39.0,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5.0,1.0,1.0,General Petite,Tops,Knits,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
