# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

**Dataset Information:**
- **Dataset Name:** Women's Clothing E-Commerce Reviews
- **File:** `Womens Clothing E-Commerce Reviews.csv`
- **Source:** This dataset contains reviews written by customers and includes features like ratings, review text, product categories, and customer information.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [88]:
# Import pandas and any other libraries you need here.

import pandas as pd
import numpy as np
# Create a new dataframe from your CSV

womens_ecommerce = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')

In [89]:
# Print out any information you need to understand your dataframe

print('.info')
womens_ecommerce.info()

print('.describe')
print(womens_ecommerce.describe())

print('.columns')
womens_ecommerce.columns

print('.head')
womens_ecommerce.head(10)

.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB
.describe
         Unnamed: 0   Clothing ID           Age        Rating  \
count  23486.000000  23486.000000  23486.000000  23486.000000   
mean  

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
7,7,858,39,"Shimmer, surprisingly goes with lots","I ordered this in carbon for store pick up, an...",4,1,4,General Petite,Tops,Knits
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


## Missing Data

Try out different methods to locate and resolve missing data.

In [90]:
# Try to find some missing data!

womens_ecommerce.isnull()
womens_ecommerce.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [91]:
review_title = {'Title': 'Untitled Review'}
womens_ecommerce.fillna(value = review_title)
womens_ecommerce = womens_ecommerce.fillna(value = review_title)
womens_ecommerce.isnull().sum()

Unnamed: 0                   0
Clothing ID                  0
Age                          0
Title                        0
Review Text                845
Rating                       0
Recommended IND              0
Positive Feedback Count      0
Division Name               14
Department Name             14
Class Name                  14
dtype: int64

In [92]:
womens_ecommerce.dropna()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,Untitled Review,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,Untitled Review,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [93]:
womens_ecommerce = womens_ecommerce.dropna()

In [94]:
womens_ecommerce.isna().sum()

Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

In [95]:
womens_ecommerce.isnull().sum()

Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [96]:
# Respond to the above questions here:

#The fields that had the most missing data were Title, Review Text, Division Name, Department Name, and Class Name. just running .isnull() did not present a lot of useful data. It return a table full of booleans. While this is an accurate depiction of when there is a piece of missing data or not, it really doesn't give us anything to work with. 
# I decided to rename all untitled review "Untilted Review" since there were so many reviews, and the actual content and text of those reviews might be helpful to consumers. Once all of the missing titles had been renamed, I removed any remaining rows that had an NA value

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [97]:
# Keep an eye out for outliers!
womens_ecommerce.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,22628.0,22628.0,22628.0,22628.0,22628.0,22628.0
mean,11737.272097,919.695908,43.28288,4.183092,0.818764,2.631784
std,6781.574232,201.683804,12.328176,1.115911,0.385222,5.78752
min,0.0,1.0,18.0,1.0,0.0,0.0
25%,5868.75,861.0,34.0,4.0,1.0,0.0
50%,11727.5,936.0,41.0,5.0,1.0,1.0
75%,17617.25,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [98]:
# Make your notes here:
womens_ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22628 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               22628 non-null  int64 
 1   Clothing ID              22628 non-null  int64 
 2   Age                      22628 non-null  int64 
 3   Title                    22628 non-null  object
 4   Review Text              22628 non-null  object
 5   Rating                   22628 non-null  int64 
 6   Recommended IND          22628 non-null  int64 
 7   Positive Feedback Count  22628 non-null  int64 
 8   Division Name            22628 non-null  object
 9   Department Name          22628 non-null  object
 10  Class Name               22628 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.1+ MB


## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplicate column. Check out the dataset to see if there is any unnecessary data.

In [99]:
# Look out for unnecessary data!
womens_ecommerce.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

In [100]:
womens_ecommerce["Rating"].value_counts()
womens_ecommerce["Division Name"].value_counts()
womens_ecommerce["Department Name"].value_counts()
womens_ecommerce["Class Name"].value_counts()

Class Name
Dresses           6145
Knits             4626
Blouses           2983
Sweaters          1380
Pants             1350
Jeans             1104
Fine gauge        1059
Skirts             903
Jackets            683
Lounge             669
Swim               332
Outerwear          319
Shorts             304
Sleep              214
Legwear            158
Intimates          147
Layering           132
Trend              118
Casual bottoms       1
Chemises             1
Name: count, dtype: int64

Did you find any unnecessary data in your dataset? How did you handle it?

In [101]:
# Make your notes here.
# I was unable to identify unnecessary data in the dataset. I did several analyses across columns as well as implented .describe() to get a pigger picture of the data. Admittedly, there is room for me to grow in my understanding of some of these mathemtical concept like standard deviation. So it's posible there are errors that I am not able to identify with my current understanding. 

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [108]:
# Look out for inconsistent data!

womens_ecommerce.head()
womens_ecommerce.tail()
womens_ecommerce["Division Name"].value_counts()
womens_ecommerce["Review Text"].value_counts()

Review Text
Perfect fit and i've gotten so many compliments. i buy all my suits from here now!                                                                                                                                                                                                                                                                                                                                                                                                                                        3
I purchased this and another eva franco dress during retailer's recent 20% off sale. i was looking for dresses that were work appropriate, but that would also transition well to happy hour or date night. they both seemed to be just what i was looking for. i ordered a 4 regular and a 6 regular, as i am usually in between sizes. the 4 was definitely too small. the 6 fit, technically, but was very ill fitting. not only is the dress itself short, but it is very short-waisted. i a

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
# I did not find any inconsistent data! I don't know if I missed something major, or if the remedy-ing of the null values in part 1 resolved a lot of the subsequent issues. Obviously the review and title columns are free text.  But looking at department and division type, all fo the language and formattign seemed consistent