# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [124]:
# Import pandas and any other libraries you need here

import pandas as pd

# 2. Load your CSV into a DataFrame

df = pd.read_csv (r"C:/Users/User/Documents/Gitdemo/data-analysis-projects/Womens Clothing E-Commerce Reviews.csv")

In [125]:
# Print out any information you need to understand your dataframe
df.info()
df.head()
df.tail()
df.describe()
df.columns
df.shape
df.dtypes
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

## Missing Data

Try out different methods to locate and resolve missing data.

In [None]:
df['Title'] = df['Title'].fillna('')
df['Review Text'] = df['Review Text'].fillna('')

# Fill missing categorical columns with 'unknown'
df['Division Name'] = df['Division Name'].fillna('unknown')
df['Department Name'] = df['Department Name'].fillna('unknown')
df['Class Name'] = df['Class Name'].fillna('unknown')

In [126]:
# Try to find some missing data!
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [None]:
#Try out different methods to locate and resolve missing data.

Using df.isnull().sum() worked well because it clearly shows missing values per column.

Looking at df.head() or df.tail() didn’t help much because they only show a few rows.

In [None]:
# Respond to the above questions here:
# There are missing values in the 'Review Text' and 'Recommended IND' columns.
Yes, I found missing data in several columns, especially Title, Review Text, Division Name, Department Name, and Class Name.
Using df.isnull().sum() worked very well because it clearly showed the number of missing values in each column.
What did not work well was relying on df.head() or df.tail() because they don’t always show missing values unless they happen to be in those specific rows.


## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [None]:
# Keep an eye out for outliers!


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:I used df.describe() to check the minimum, maximum, and quartile values. This helped me spot unusually large numbers, especially in Positive Feedback Count.
This method was effective because it summarizes the whole dataset and makes unusual values easy to identify without needing any graphs.


## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!


Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.

Yes, some columns—such as Clothing ID oor Title—may be unnecessary depending on the analysis.
To handle this, I checked all the column names using df.columns and used df.drop() to remove columns that were not helpful for my analysis.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!Yes, I found inconsistent formatting in text columns, such as Division Name, Department Name, and Class Name. Some values used different capitalization or had extra spaces.
To clean this, I used .str.lower() to make all text lowercase and .str.strip() to remove extra spaces. I also checked that numerical columns had the correct data types.
