Author: Tyler Chase

Date: 2017/05/23

# Determine Dataset Statistics

Due to reading errors and different numbers of posts per month in various subreddits it is important to consider the label statistics of our dataset. 

## Key Statistics to Consider

* Total number of pictures
* Total number of pictures in each sub
* Total number of nsfw images vs sfw images
* In each sub how many images are nsfw vs sfw

## Load Data

In [1]:
import tensorflow as tf
import numpy as np
import math
import timeit
import random
import pickle
import matplotlib.pyplot as plt
import itertools
from sklearn.metrics import confusion_matrix
%matplotlib inline

In [2]:
# Form training, developement, and testing data sets
address = r'/Users/tylerchase/Documents/Stanford_Classes/CS231n_CNN_for_Visual_Recognition/final_project/CS-231N-Final-Project/data/fullData//'
file_names = {}
file_names['nsfw'] = 'full_nsfwlabels'
file_names['subs'] = 'full_subredditlabels'
file_names['dict'] = 'full_subredditIndex'

# Open Label Files
with open(address + file_names['subs'], 'rb') as file_1:
    subs = pickle.load(file_1)
    subs = np.array(subs)
with open(address + file_names['dict'], 'rb') as file_2:
    dictionary = pickle.load(file_2)
with open(address + file_names['nsfw'], 'rb') as file_3:
    nsfw = pickle.load(file_3)
    nsfw = np.array(nsfw)

# Print the sizes as a sanity check
print('Subreddit Labels Shape: ', subs.shape)
print('NSFW Labels shape: ', nsfw.shape)

Subreddit Labels Shape:  (31813,)
NSFW Labels shape:  (31813,)


## Determine Subreddit Statistics

In [3]:
# Print and store subreddits and subreddit totals
num_subs = len(dictionary)
classes = [""] * num_subs
stats = [0] * num_subs

# Form Array of Subreddits
for sub, ind in dictionary.items():
    classes[ind] = sub

# Form array of Subreddit statistics and print
for i, j in enumerate(classes):
    temp = np.sum(i == subs)
    stats[i] = temp
    print(j + ' Submissions: ', temp)
print('Sanity Check Sum: ', np.sum(stats))

# Print total submissions
total = np.shape(subs)[0]
print('\nTotal Submissions: ', total)

EarthPorn Submissions:  1707
SkyPorn Submissions:  1702
spaceporn Submissions:  1642
MilitaryPorn Submissions:  1677
GunPorn Submissions:  1639
carporn Submissions:  1669
CityPorn Submissions:  1667
ruralporn Submissions:  1217
ArchitecturePorn Submissions:  1593
FoodPorn Submissions:  1684
MoviePosterPorn Submissions:  1701
ArtPorn Submissions:  1696
RoomPorn Submissions:  1702
creepy Submissions:  1594
gonewild Submissions:  1225
PrettyGirls Submissions:  1648
ladybonersgw Submissions:  1147
LadyBoners Submissions:  1505
cats Submissions:  1683
dogpictures Submissions:  1715
Sanity Check Sum:  31813

Total Submissions:  31813


## Determine NSFW Statistics

In [4]:
dict_nsfw = {}
dict_nsfw['NSFW'] = 1
dict_nsfw['SFW'] = 0

# Print and store NSFW and NSFW totals
num_out = len(dict_nsfw)
classes_nsfw = [""] * num_out
stats_nsfw = [0] * num_out
for category, ind in dict_nsfw.items():
    classes_nsfw[ind] = category
    temp = np.sum(ind == nsfw)
    stats_nsfw[ind] = temp
    print(category + ' Submissions: ', temp)
print('Sanity Check Sum: ', np.sum(stats_nsfw))

total_nsfw = np.shape(nsfw)[0]
print('\nTotal Submissions: ', total_nsfw)

NSFW Submissions:  2708
SFW Submissions:  29105
Sanity Check Sum:  31813

Total Submissions:  31813


## Determine NSFW Breakdown of Subreddits

In [5]:
nsfw_breakdown = {}

# Store and print NSFW breakdown of each Subreddit
for i,j in enumerate(classes):
    nsfw_sub = {}
    class_indices = np.argwhere(subs == i)
    nsfw_subset = nsfw[class_indices]
    nsfw_sub['nsfw'] = np.sum(nsfw_subset == 1)
    nsfw_sub['sfw'] = np.sum(nsfw_subset == 0)
    nsfw_breakdown[j] = nsfw_sub
    print(j, ': ', nsfw_sub['nsfw'] + nsfw_sub['sfw'])
    print('NSFW: ', nsfw_sub['nsfw'])
    print('SFW: ', nsfw_sub['sfw'])
    print()


EarthPorn :  1707
NSFW:  0
SFW:  1707

SkyPorn :  1702
NSFW:  0
SFW:  1702

spaceporn :  1642
NSFW:  0
SFW:  1642

MilitaryPorn :  1677
NSFW:  6
SFW:  1671

GunPorn :  1639
NSFW:  1
SFW:  1638

carporn :  1669
NSFW:  1
SFW:  1668

CityPorn :  1667
NSFW:  0
SFW:  1667

ruralporn :  1217
NSFW:  0
SFW:  1217

ArchitecturePorn :  1593
NSFW:  0
SFW:  1593

FoodPorn :  1684
NSFW:  0
SFW:  1684

MoviePosterPorn :  1701
NSFW:  14
SFW:  1687

ArtPorn :  1696
NSFW:  140
SFW:  1556

RoomPorn :  1702
NSFW:  0
SFW:  1702

creepy :  1594
NSFW:  120
SFW:  1474

gonewild :  1225
NSFW:  1225
SFW:  0

PrettyGirls :  1648
NSFW:  0
SFW:  1648

ladybonersgw :  1147
NSFW:  1147
SFW:  0

LadyBoners :  1505
NSFW:  52
SFW:  1453

cats :  1683
NSFW:  1
SFW:  1682

dogpictures :  1715
NSFW:  1
SFW:  1714

