# Data Exploration 
#### Due Wednesday, December 1st, 2021
#### By Aaliyah Hänni, Vanessa Joy Hsu, Liem Luong, Dwight Sablan

This section of your proposal (and therefore the document you should submit for this) should summarize an exploration of the data.  You will want to explore the data through visualizations and inspection, looking for:

* feature set defined by name (or identifier if sensitive), variable type, range, etc. - may want statistical plots
* missing values 
* outliers
* correlated features
* likely sources of bias
* other data flaws (is the set too small?  are you missing a feature?)
 
You will also want to discuss the specific features that you are extracting for your project (and justify those features), as well as any data cleaning or augmentation processes you are using.


Data Source: *Clothing Pattern Dataset* by Alexander J. Medeiros, Lee Stearns, Leah Findlater, Chuan Chen, and Jon E. Froehlich 
https://github.com/lstearns86/clothing-pattern-dataset

In [1]:
import pandas as pd
import numpy as np

import urllib
from PIL import Image

import requests
from io import BytesIO

import glob, os

In [2]:
clothingDataset = pd.read_csv('googleClothingDataset.csv')

In [3]:
clothingDataset.head(5)

Unnamed: 0,Class Name,URL,Original Width,Original Height,Crop X,Crop Y,Crop Width,Crop Height,Scales
0,solid,https://www.publicdomainpictures.net/pictures/...,1920,1280,0,0,1920,1280,0.2474874;0.4949747;0.9899495;1.979899
1,solid,https://c1.staticflickr.com/9/8208/8185035876_...,5456,3064,0,0,5456,3064,0.103389;0.206778;0.4135559;0.8271118;1.654224
2,solid,https://cdn.pixabay.com/photo/2017/08/14/22/24...,3680,2760,0,0,3680,2760,0.1147768;0.2295535;0.459107;0.918214;1.836428
3,solid,https://upload.wikimedia.org/wikipedia/commons...,2816,2112,0,0,2816,2112,0.1499923;0.2999847;0.5999694;1.199939
4,solid,https://c1.staticflickr.com/9/8753/17091052376...,2500,1668,0,0,2500,1668,0.1899184;0.3798367;0.7596735;1.519347


In [4]:
clothingDataset.describe

<bound method NDFrame.describe of      Class Name                                                URL  \
0         solid  https://www.publicdomainpictures.net/pictures/...   
1         solid  https://c1.staticflickr.com/9/8208/8185035876_...   
2         solid  https://cdn.pixabay.com/photo/2017/08/14/22/24...   
3         solid  https://upload.wikimedia.org/wikipedia/commons...   
4         solid  https://c1.staticflickr.com/9/8753/17091052376...   
...         ...                                                ...   
2744    zig zag  https://previews.123rf.com/images/sn333g/sn333...   
2745    zig zag  http://www.sheetworld.com/p_images/sxb3242_W11...   
2746    zig zag  https://www.adamrossfabrics.co.uk/wp-content/u...   
2747    zig zag  https://secure.img2-ag.wfcdn.com/im/41114327/r...   
2748    zig zag  https://thumbs.dreamstime.com/z/green-brown-zi...   

      Original Width  Original Height  Crop X  Crop Y  Crop Width  \
0               1920             1280       0       0   

In [5]:
clothingDataset.dtypes

Class Name         object
URL                object
Original Width      int64
Original Height     int64
Crop X              int64
Crop Y              int64
Crop Width          int64
Crop Height         int64
Scales             object
dtype: object

## Extract Images from URL

In [None]:
images = [] #df used to store valid images. Format: pattern, image
badURLs = [] #df used to store invalid images/broken urls. Format: patter, url

for counter, row in enumerate(clothingDataset.iloc[:, :2].values.tolist()):
    try:
        #downloads images to folder
        #urlReq = urllib.request.urlretrieve(row[1], row[0] + '_' + str(counter) + '.jpg')
        #images.append(row[0] + '_' + str(counter) + '.png')
        #append images to a dataframe
        r = requests.get(row[1])
        images.append([row[0], Image.open(BytesIO(r.content))])
    except Exception:
        badURLs.append([row[0], row[1]])
        continue

In [None]:
print("Valid Image URLs:   ", len(images))
print("Invalid Image URLs: ", len(badURLs))

In [None]:
#format the images into a dataframe
images = pd.DataFrame(images, columns = ['pattern', 'image'])
badURLs = pd.DataFrame(badURLs, columns = ['pattern', 'url'])

In [None]:
print("VALID IMAGE URLS")
print('')
print("     Solid: ", len(images[images.pattern == 'solid']))
print("     Checkered: ", len(images[images.pattern == 'checkered']))
print("     Floral: ", len(images[images.pattern == 'floral']))
print("     Dotted: ", len(images[images.pattern == 'dotted']))
print("     Striped: ", len(images[images.pattern == 'striped']))
print("     Zig Zag: ", len(images[images.pattern == 'zig zag']))
print('')
print("     Total: ", len(images))

In [None]:
print("INVALID IMAGE URLS")
print('')
print("     Solid: ", len(badURLs[badURLs.pattern == 'solid']))
print("     Checkered: ", len(badURLs[badURLs.pattern == 'checkered']))
print("     Floral: ", len(badURLs[badURLs.pattern == 'floral']))
print("     Dotted: ", len(badURLs[badURLs.pattern == 'dotted']))
print("     Striped: ", len(badURLs[badURLs.pattern == 'striped']))
print("     Zig Zag: ", len(badURLs[badURLs.pattern == 'zig zag']))
print('')
print("     Total: ", len(badURLs))

In [None]:
#display images
size = 128, 128 #reshape to thumbnail size
for i in range(5):
    images.image[i].thumbnail(size)
    display(images.image[i])

## Explore FingerCamera Dataset
FingerCamera

\checkered

\dotted

\floral

\solid

\striped

\zig zag

## Cited Sources
* https://github.com/lstearns86/clothing-pattern-dataset 
* https://www.kite.com/python/answers/how-to-read-an-image-data-from-a-url-in-python
* https://pillow.readthedocs.io/en/stable/reference/Image.html
* https://datascience.stackexchange.com/questions/58351/how-to-retrieve-images-from-a-url-in-a-pandas-dataframe-and-store-them-as-pil-ob

* https://stackoverflow.com/questions/46107348/how-to-display-image-stored-in-pandas-dataframe