# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [158]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
womens_clothing = pd.read_csv(r"./Womens Clothing E-Commerce Reviews.csv")

In [159]:
# Print out any information you need to understand your dataframe
print(womens_clothing)

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

## Missing Data

Try out different methods to locate and resolve missing data.

In [160]:

# Try to find some missing data!

womens_clothing.isna()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,False,False,False,True,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
23481,False,False,False,False,False,False,False,False,False,False,False
23482,False,False,False,False,False,False,False,False,False,False,False
23483,False,False,False,False,False,False,False,False,False,False,False
23484,False,False,False,False,False,False,False,False,False,False,False


Did you find any missing data? What things worked well for you and what did not?

In [161]:
# Respond to the above questions here: I have found the missing data where is null.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [163]:
# Keep an eye out for outliers!
outlier_age = np.where((womens_clothing['Age']< 18) | (womens_clothing['Age'] > 50))
print(outlier_age)
#womens_clothing.drop(womens_clothing[outlier_age])


(array([    2,    10,    12, ..., 23463, 23467, 23485]),)


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [164]:
# Make your notes here:

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [165]:
# Look out for unnecessary data!
womens_clothing.drop(columns=['Recommended IND'])

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,2,General,Dresses,Dresses


Did you find any unnecessary data in your dataset? How did you handle it?

In [166]:
# Make your notes here. Yes, I have found one unnessary column "Recommended IND"

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [167]:
# Look out for inconsistent data!
womens_clothing['Title'].fillna('No Title')

0                                                 No Title
1                                                 No Title
2                                  Some major design flaws
3                                         My favorite buy!
4                                         Flattering shirt
                               ...                        
23481                       Great dress for many occasions
23482                           Wish it was made of cotton
23483                                Cute, but see through
23484    Very cute dress, perfect for summer parties an...
23485                      Please make more like this one!
Name: Title, Length: 23486, dtype: object

Did you find any inconsistent data? What did you do to clean it?

In [169]:
# Make your notes here! Replaced all empty titles with "No Title"