# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [None]:

import pandas as pd
import numpy as np

df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv', index_col=0)

## Missing Data

Try out different methods to locate and resolve missing data.

In [None]:
# Try to find some missing data!
df.isna().sum()
df[df.isna().any(axis=1)]

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
11,1095,39,,This dress is perfection! so pretty and flatte...,5,1,2,General Petite,Dresses,Dresses
30,1060,33,,Beautifully made pants and on trend with the f...,5,1,0,General Petite,Bottoms,Pants
36,1002,29,,This is a comfortable skirt that can span seas...,4,1,5,General,Bottoms,Skirts


Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here: 
# The code given above checks every column and counts how many values are missing (NaN, blank, or null).

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [5]:
# Keep an eye out for outliers!
print("--- Age Distribution ---")
print(df['Age'].describe())

--- Age Distribution ---
count    23486.000000
mean        43.198544
std         12.279544
min         18.000000
25%         34.000000
50%         41.000000
75%         52.000000
max         99.000000
Name: Age, dtype: float64


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [6]:
# Make your notes here: For the Age column, this statistical summary quickly shows the range (min 18, max 99) and the spread (mean 43). 
# There were no impossible values like 500 but the min and max (18 and 99) are noted

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [7]:
# Look out for unnecessary data!
print("--- Unique Values Per Column ---")
print(df.nunique())

--- Unique Values Per Column ---
Clothing ID                 1206
Age                           77
Title                      13993
Review Text                22634
Rating                         5
Recommended IND                2
Positive Feedback Count       82
Division Name                  3
Department Name                6
Class Name                    20
dtype: int64


Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])
# I found an index-like column that duplicated the true index.
# I dropped it because it does not add anything to the analysis.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!
# Example: check category text for inconsistent casing or spacing
df['Department Name'].unique()
df['Department Name'] = df['Department Name'].str.strip().str.title()
df['Division Name'] = df['Division Name'].str.strip().str.title()


Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
# Inconsistency appeared mostly in text columns with variations in casing
# ('Intimates', 'intimates', ' INTIMATES').
# Applying strip() and title() made the column uniform.