# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

**Dataset Information:**
- **Dataset Name:** Women's Clothing E-Commerce Reviews
- **File:** `Womens Clothing E-Commerce Reviews.csv`
- **Source:** This dataset contains reviews written by customers and includes features like ratings, review text, product categories, and customer information.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [9]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
data = pd.read_csv("Womens_Clothing_E-Commerce_Reviews.csv")

In [None]:
# Print out any information you need to understand your dataframe


## Missing Data

Try out different methods to locate and resolve missing data.

In [10]:
# Try to find some missing data!
data.isna().sum()
data.info()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
From the output:

Title → 3810 missing values

Review Text → 845 missing

Division Name → 14 missing

Department Name → 14 missing

Class Name → 14 missing
I found missing data in several columns. The Title column has the most missing values (3810), followed by Review Text with 845 missing entries.
The categorical columns — Division Name, Department Name, and Class Name — each had 14 missing values.

I used data.info() and data.isna().sum() to detect the missing values.
These methods worked well because they quickly gave me exact counts for each column.

What didn’t work was visually scanning the dataset — because it is too large to manually check.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [17]:
# Keep an eye out for outliers!
data[['Age', 'Positive Feedback Count', 'Rating']].describe()


Unnamed: 0,Age,Positive Feedback Count,Rating
count,23486.0,23486.0,23486.0
mean,43.198544,2.535936,4.196032
std,12.279544,5.702202,1.110031
min,18.0,0.0,1.0
25%,34.0,0.0,4.0
50%,41.0,1.0,5.0
75%,52.0,3.0,5.0
max,99.0,122.0,5.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
I used data.describe() to identify unusual values in the numeric columns.

The Age column had some values that were unusually high, which could indicate outliers.

The Positive Feedback Count column also contained extreme values far above the median.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplicate column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!
I found that the Unnamed: 0 column was unnecessary because it appears to be an index column automatically created when the dataset was exported.

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
data = data.drop(columns=['Unnamed: 0'])
#It does not provide any analytical value, so I removed it using drop.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!
Columns that may contain inconsistencies:
Division Name
Department Name
Class Name
Title
Review Text
These columns are strings, so we look for:
leading/trailing spaces
inconsistent capitalization
multiple spellings of the same category

Did you find any inconsistent data? What did you do to clean it?

In [20]:
data["Division Name"].unique()
data["Department Name"].unique()
data["Class Name"].unique()

array(['Intimates', 'Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear',
       'Lounge', 'Sweaters', 'Skirts', 'Fine gauge', 'Sleep', 'Jackets',
       'Swim', 'Trend', 'Jeans', 'Legwear', 'Shorts', 'Layering',
       'Casual bottoms', nan, 'Chemises'], dtype=object)

In [21]:
data["Division Name"] = data["Division Name"].str.strip().str.title()
data["Department Name"] = data["Department Name"].str.strip().str.title()
data["Class Name"] = data["Class Name"].str.strip().str.title()

In [None]:
# Make your notes here!
I checked categorical columns using .unique() and identified formatting inconsistencies.
In the Class Name column, some values used inconsistent capitalization (e.g., “Fine gauge”, “Casual bottoms”), and some entries had missing values (NaN).
I standardized formatting by using .str.strip().str.title() to ensure all class names follow the same title-case format. Missing values were replaced with "Unknown" to maintain consistency.