# Exploratory Data Analysis (EDA)

## Load Datasets

In [2]:
import pandas as pd
df_electronics = pd.read_csv('electronics.csv')
df_modcloth = pd.read_csv('modcloth.csv')

In [11]:
print("list of columns in electronics.csv")
print(df_electronics.columns)
print("number of rows in electronics.csv",df_electronics.shape[0])
print("top 5 rows")
df_electronics.head()

list of columns in electronics.csv
Index(['item_id', 'user_id', 'rating', 'timestamp', 'model_attr', 'category',
       'brand', 'year', 'user_attr', 'split'],
      dtype='object')
number of rows in electronics.csv 1292954
top 5 rows


Unnamed: 0,item_id,user_id,rating,timestamp,model_attr,category,brand,year,user_attr,split
0,0,0,5.0,1999-06-13,Female,Portable Audio & Video,,1999,,0
1,0,1,5.0,1999-06-14,Female,Portable Audio & Video,,1999,,0
2,0,2,3.0,1999-06-17,Female,Portable Audio & Video,,1999,,0
3,0,3,1.0,1999-07-01,Female,Portable Audio & Video,,1999,,0
4,0,4,2.0,1999-07-06,Female,Portable Audio & Video,,1999,,0


In [8]:
## Load Datasetsprint("list of columns in modcloth.csv")
print(df_modcloth.columns)
print("number of rows in modcloth.csv",df_modcloth.shape[0])
print("top 5 rows")
df_modcloth.head()

list of columns in modcloth.csv
Index(['item_id', 'user_id', 'rating', 'timestamp', 'size', 'fit', 'user_attr',
       'model_attr', 'category', 'brand', 'year', 'split'],
      dtype='object')
number of rows in modcloth.csv 99893
top 5 rows


Unnamed: 0,item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,category,brand,year,split
0,7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
1,7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,2012,0
2,7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
3,7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0
4,7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,2012,0


### Conclusion

1. Data for electronics is almost 12-13 times more than for cloth
2. Data for cloth on the other hand has more features
3. Both the datasets have a timestamp field, which can be used for sorting and splitting accordingly.

## Find the null percentage and unique values in each column of each dataset

In [25]:
def null_percentage(df):
    total_rows = df.shape[0]
    for c in df.columns:
        null_count = df[c].isna().sum()
        print("Column : ",c," , Null values percentage : ",(null_count/total_rows)*100)
        
def unique_values_count(df):
    for c in df.columns:
        unique_count = len(df[c].unique())
        print("Column : ",c," , Unique values : ",unique_count)

In [26]:
null_percentage(df_electronics)
print("")
unique_values_count(df_electronics)

Column :  item_id  , Null values percentage :  0.0
Column :  user_id  , Null values percentage :  0.0
Column :  rating  , Null values percentage :  0.0
Column :  timestamp  , Null values percentage :  0.0
Column :  model_attr  , Null values percentage :  0.0
Column :  category  , Null values percentage :  0.0
Column :  brand  , Null values percentage :  74.39042688293628
Column :  year  , Null values percentage :  0.0
Column :  user_attr  , Null values percentage :  86.53285422373882
Column :  split  , Null values percentage :  0.0

Column :  item_id  , Unique values :  9560
Column :  user_id  , Unique values :  1157633
Column :  rating  , Unique values :  5
Column :  timestamp  , Unique values :  6354
Column :  model_attr  , Unique values :  3
Column :  category  , Unique values :  10
Column :  brand  , Unique values :  51
Column :  year  , Unique values :  20
Column :  user_attr  , Unique values :  3
Column :  split  , Unique values :  3


In [28]:
null_percentage(df_modcloth)
print("")
unique_values_count(df_modcloth)

Column :  item_id  , Null values percentage :  0.0
Column :  user_id  , Null values percentage :  0.0010010711461263552
Column :  rating  , Null values percentage :  0.0
Column :  timestamp  , Null values percentage :  0.0
Column :  size  , Null values percentage :  21.78330813970949
Column :  fit  , Null values percentage :  18.52582263021433
Column :  user_attr  , Null values percentage :  8.375962279639214
Column :  model_attr  , Null values percentage :  0.0
Column :  category  , Null values percentage :  0.0
Column :  brand  , Null values percentage :  74.05924339042775
Column :  year  , Null values percentage :  0.0
Column :  split  , Null values percentage :  0.0

Column :  item_id  , Unique values :  1020
Column :  user_id  , Unique values :  44784
Column :  rating  , Unique values :  5
Column :  timestamp  , Unique values :  14741
Column :  size  , Unique values :  10
Column :  fit  , Unique values :  6
Column :  user_attr  , Unique values :  3
Column :  model_attr  , Unique v

### Conclusion

1. Data for electronics has no null values other than for `brands` and `user_attr`. These fields have a very high null value percentage, therefore are unfit for further analysis.
2. Data for cloth on the other hand has null values even for `user_id` field. Since this field is key to our formula, we will drop all rows with null user_id. Features `size`, `fit` and `user_attr` can be used after imputation. Feature `brand` has a very high null value percentage, therefore is unfit for further analysis. 
3. Both the datasets have features which have less than or equal to 10 unique values. Such features can be one hot encoded.