# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [2]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
womensEcom = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
# Create a new dataframe from your CSV
eCom_DF = pd.DataFrame(womensEcom)

In [3]:
# Print out any information you need to understand your dataframe
print(eCom_DF)

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

## Missing Data

Try out different methods to locate and resolve missing data.

In [4]:
# Try to find some missing data!

#eCom_DF.isna() == True
eCom_DF.isnull().sum()

remove_missing= eCom_DF.dropna()

remove_missing.isnull().sum()

Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [5]:
# Respond to the above questions here: 
#isna did not work well since it just gave top and bottom 5. isnull.sum told us how many missing values were per column

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [16]:
# Keep an eye out for outliers!
remove_missing.describe()



Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,19662.0,19662.0,19662.0,19662.0,19662.0,19662.0
mean,11755.260655,921.297274,43.260808,4.183145,0.818177,2.652477
std,6772.063092,200.227528,12.258122,1.112224,0.385708,5.834285
min,2.0,1.0,18.0,1.0,0.0,0.0
25%,5888.25,861.0,34.0,4.0,1.0,0.0
50%,11749.5,936.0,41.0,5.0,1.0,1.0
75%,17624.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [7]:
# Make your notes here:
#describe() helped - all items are positive does not seem to have any outliers

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [19]:
# Look out for unnecessary data!
remove_missing.columns

remove_missing.drop(columns=['Division Name', 'Department Name', 'Class Name'])


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1
...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2


Did you find any unnecessary data in your dataset? How did you handle it?

In [9]:
# Make your notes here.

#since clothing ID has the exact product, we did not need the extra details of said product so I dropped those columns

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [24]:
# Look out for inconsistent data!


#remove_missing.Recommended_ID.astype('bool')
change_bool = remove_missing['Recommended IND'].astype('bool')

remove_missing.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19662 entries, 2 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               19662 non-null  int64 
 1   Clothing ID              19662 non-null  int64 
 2   Age                      19662 non-null  int64 
 3   Title                    19662 non-null  object
 4   Review Text              19662 non-null  object
 5   Rating                   19662 non-null  int64 
 6   Recommended IND          19662 non-null  int64 
 7   Positive Feedback Count  19662 non-null  int64 
 8   Division Name            19662 non-null  object
 9   Department Name          19662 non-null  object
 10  Class Name               19662 non-null  object
dtypes: int64(6), object(5)
memory usage: 1.8+ MB


Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!