# Clothing Fit / json Processing
## (Self-Guided Project)

## by Justin Sierchio

In this project, we will be looking at using clothing fit data to learn about json processing.

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/soujanyag/modcloth-data-preprocessing-for-beginner. More information related to the dataset can be found at the same link.

Note: this is a self-guided project following the tutorial provided by Soujanya G at Kaggle.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
mc_data = pd.read_json('modcloth_final_data.json', lines=True)

print('Dataset uploaded!');

Dataset uploaded!


Let's display the first 5 rows for this dataset.

In [3]:
# Display 1st 5 rows of modified clothing dataset
mc_data.head()

Unnamed: 0,item_id,waist,size,quality,cup size,hips,bra size,category,bust,height,user_name,length,fit,user_id,shoe size,shoe width,review_summary,review_text
0,123373,29.0,7,5.0,d,38.0,34.0,new,36.0,5ft 6in,Emily,just right,small,991571,,,,
1,123373,31.0,13,3.0,b,30.0,36.0,new,,5ft 2in,sydneybraden2001,just right,small,587883,,,,
2,123373,30.0,7,2.0,b,,32.0,new,,5ft 7in,Ugggh,slightly long,small,395665,9.0,,,
3,123373,,21,5.0,dd/e,,,new,,,alexmeyer626,just right,fit,875643,,,,
4,123373,,18,5.0,b,,36.0,new,,5ft 2in,dberrones1,slightly long,small,944840,,,,


## Exploratory Data Analysis

In order to avoid potential coding errors, a good practice is to make sure all the names of the columns use underscores rather than spaces. We can accomplish this task by modifying the column names accordingly.

In [4]:
# Modify Column Names
mc_data.columns = ['item_id', 'waist', 'mc_size', 'quality', 'cup_size', 'hips', 'bra_size', 'category', 'bust', 'height', 
                   'user_name', 'length', 'fit', 'user_id', 'shoe_size', 'shoe_width', 'review_summary', 'review_test']

Now let's check the overall information for each column.

In [5]:
mc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82790 entries, 0 to 82789
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   item_id         82790 non-null  int64  
 1   waist           2882 non-null   float64
 2   mc_size         82790 non-null  int64  
 3   quality         82722 non-null  float64
 4   cup_size        76535 non-null  object 
 5   hips            56064 non-null  float64
 6   bra_size        76772 non-null  float64
 7   category        82790 non-null  object 
 8   bust            11854 non-null  object 
 9   height          81683 non-null  object 
 10  user_name       82790 non-null  object 
 11  length          82755 non-null  object 
 12  fit             82790 non-null  object 
 13  user_id         82790 non-null  int64  
 14  shoe_size       27915 non-null  float64
 15  shoe_width      18607 non-null  object 
 16  review_summary  76065 non-null  object 
 17  review_test     76065 non-null 

We notice that some of the columns have sparse data. To determine the exact amounts, let's use the 'isnull' method.

In [6]:
# Find the exact amounts of sparse data
missing_data_sum = mc_data.isnull().sum()
missing_data = pd.DataFrame({'total_missing_values': missing_data_sum,'percentage_of_missing_values': (missing_data_sum/mc_data.shape[0])*100})
missing_data

Unnamed: 0,total_missing_values,percentage_of_missing_values
item_id,0,0.0
waist,79908,96.518903
mc_size,0,0.0
quality,68,0.082136
cup_size,6255,7.55526
hips,26726,32.281677
bra_size,6018,7.268994
category,0,0.0
bust,70936,85.681846
height,1107,1.337118


According to this output, only 6/18 columns have complete data. 5/18 columns are missing more than 60% of the values.