# Dataset Subsampling and Visualization
- Conduct exploratory visualization on the pre-processed data
- Sub-sample the dataset to have equal entries for positive, negative, and neutral reviews in our dataset
- Visualize the changes made to create the subsampled balanced dataset

In [1]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

RANDOM_STATE = 42
df = pd.read_csv("dataset.csv", encoding='cp1252')

- 48k positive reviews
- 16k neutral reviews
- 3.4k negative reviews

In [2]:
df.groupby('label').count()

Unnamed: 0_level_0,bookID,title,author,rating,ratingsCount,reviewsCount,reviewerName,reviewerRatings,review
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-1,3434,3426,3431,3434,3434,3434,3403,3434,3434
0,16384,16354,16348,16384,16384,16384,16197,16384,16384
1,48622,48498,48464,48622,48622,48622,48185,48622,48622


## Visualize
The pie plot demonstrates that positive reviews make up 71.04% of the data

In [3]:
fig, ax = plt.subplots()
ax.set_title('Percentage of Reviews in Each Category')
bp = df.groupby('label').count()['review'].plot(kind='pie',autopct='%.2f', ax=ax)
plt.show()

## Subsample the data 
- Idea: each label needs to be represented equally
- Create seperate dataframes filled with positive, negative, and neutral values
- To balance the dataset we will randomly select 3300 values of each data frame

In [4]:
df_pos = df[df['label']==1]
df_neutral = df[df['label']==0]
df_neg = df[df['label']==-1]

df_neg_sampled = df_neg.sample(3300, random_state=RANDOM_STATE)
df_pos_sampled = df_pos.sample(3300, random_state=RANDOM_STATE)
df_neutral_sampled = df_neutral.sample(3300, random_state=RANDOM_STATE)

Combine them into one dataframe where each label is represented equally

In [5]:
frames = [df_neg_sampled, df_pos_sampled, df_neutral_sampled]
result = pd.concat(frames)
result.groupby('label').count()

Unnamed: 0_level_0,bookID,title,author,rating,ratingsCount,reviewsCount,reviewerName,reviewerRatings,review
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-1,3300,3292,3297,3300,3300,3300,3271,3300,3300
0,3300,3298,3297,3300,3300,3300,3259,3300,3300
1,3300,3292,3290,3300,3300,3300,3275,3300,3300


Write our balanced data to a csv file

In [6]:
result.to_csv('balanced_dataset.csv', encoding='utf-8', index=False)

## Visualize the sub-sampled data
The pie plot now shows that every label is represented equally

In [7]:
df_balanced = pd.read_csv("balanced_dataset.csv")
fig, ax = plt.subplots()
ax.set_title('Percentage of Reviews in Each Category After Subsampling')
bp = df_balanced.groupby('label').count()['review'].plot(kind='pie',autopct='%.2f', ax=ax)
plt.show()