In [3]:
## Begin Setup
import pandas as pd
import numpy as np
## End Setup

# Intro to Pandas Data Cleaning 

<center><img src="../images/stock/pexels-pixabay-434163.jpg" width="500"></center>

## Introduction
Why is data cleaning important?

* Real-world data is often messy, with errors, inconsistencies, and missing values.

* Cleaning ensures data quality, leading to more reliable analysis and accurate conclusions.

## Overview of common data cleaning steps:

* __Handling Missing Values__: Sometimes data has empty spots. We need to decide what to do with them

* __Removing Irrelevant Data__: Not all the information we have will be useful. We want to get rid of things like duplicate entries or columns that don't tell us anything new. 

* __Filtering Data__: Often, we only want to look at a specific part of our data. Filtering lets us select just the rows that meet certain conditions.

* __Handling Outliers__: Outliers are data points that are very different from the rest. They can sometimes be real, but other times they're errors. 

* __Correcting Inconsistencies__: Data can be recorded in different ways even when it means the same thing (like "USA" vs. "United States"). We need to make sure these things are consistent so our analysis is accurate.

### Handling Missing Values

<center><img src="../images/stock/pexels-matilda-wormwood-4099467.jpg" width="500"></center>

Sometimes, our data has empty spots, which Pandas represents as `NaN` (Not a Number). We need to deal with these so they don't mess up our analysis.

#### Identifying Missing Values:

* We can use `df.isna()` to create a table where `True` indicates a missing value and `False` means the value is present.

* Conversely, `df.notna()` gives us a table where `True` means the value is present and `False` indicates a missing value.

#### Handling Missing Data:

There are a few common ways to handle missing values:

* __Removing Missing Data:__

    * We can remove entire rows that have missing values using `df.dropna()`. Be careful, as this can reduce the size of your dataset.
    * We can also remove entire columns with missing values by using `df.dropna(axis=1)`. This is useful if a whole feature has a lot of missing information.


* __Filling Missing Data:__

    * Instead of removing data, we can try to fill in the gaps with estimated values using `df.fillna()`:

        * Fill with a specific value: `df.fillna(0)` would replace all `NaNs` with 0. You can use any value that makes sense for your data.
        * Fill with the mean: `df.fillna(df.mean())` replaces missing numerical values with the average of that column.
        * Fill with the median: `df.fillna(df.median())` replaces missing numerical values with the middle value, which is less affected by extreme values.

By using these Pandas functions, we can effectively identify and handle missing data in our datasets.

__Consider the Following__

A dataset of customer information contains missing values in the `Email`, `Phone Number`, and `Address` columns.

We can use `df.isna()` to locate these missing values.

In [4]:
### Begin Example
# Create a sample DataFrame
data = {'Customer_ID': [1, 2, 3, 4, 5],
        'Email': ['john.doe@example.com', 
                  np.nan, 
                  'jane.smith@example.com', 
                  'alice.johnson@example.com', 
                  np.nan
                 ],
        'Phone Number': ['123-456-7890', 
                         '987-654-3210', 
                         np.nan, 
                         '555-123-4567', 
                         np.nan
                        ],
        'Address': ['123 Main St', 
                    '456 Oak Ave', 
                    '789 Pine Ln', 
                    np.nan, 
                    '246 Elm St'
                   ]}

customer_df = pd.DataFrame(data)

customer_df.isna()
### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,False,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,False,False,True
4,False,True,True,False


If many rows have missing 'Email' or 'Phone Number', we might drop those rows using `df.dropna()`.

In [5]:
### Begin Example
# Drop rows where 'Email' or 'Phone Number' is missing
cleaned_customer_df = customer_df.dropna(subset=['Email', 'Phone Number'])
cleaned_customer_df
### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,1,john.doe@example.com,123-456-7890,123 Main St
3,4,alice.johnson@example.com,555-123-4567,


If only a few values are missing, we might fill them with a placeholder. 

For `Address`, we might use `Unknown`.  For `Email` and `Phone Number` we might use `""`

In [6]:
### Begin Example
# Fill missing 'Address' values with 'Unknown'
customer_df['Address'] = customer_df['Address'].fillna('Unknown')
customer_df['Email'] = customer_df['Email'].fillna('')
customer_df['Phone Number'] = customer_df['Phone Number'].fillna('')

customer_df
### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,1,john.doe@example.com,123-456-7890,123 Main St
1,2,,987-654-3210,456 Oak Ave
2,3,jane.smith@example.com,,789 Pine Ln
3,4,alice.johnson@example.com,555-123-4567,Unknown
4,5,,,246 Elm St


### Removing Irrelevant Data

<center><img src="../images/stock/pexels-cottonbro-4108715.jpg" width="500"></center>

Sometimes, our datasets contain information that doesn't help us with our analysis. Getting rid of this "noise" makes our data cleaner and easier to work with.

#### Identifying and Removing Unnecessary Columns:

Often, datasets include columns that are unique identifiers (like `Customer_ID`) or provide information that isn't relevant to our questions.

For example, if we're analyzing customer purchase behavior, the unique `Customer_ID` might not tell us anything about what they buy.

To remove one or more columns, we use the `df.drop()` function. We need to specify the names of the columns we want to remove and set `axis=1` to indicate we're dropping columns (as opposed to rows).

```python
# Example: Removing the 'Customer_ID' column
df = df.drop('Customer_ID', axis=1)

# Example: Removing multiple columns
columns_to_drop = ['Transaction_Time', 'Unused_Column']
df = df.drop(columns_to_drop, axis=1)
```

* Identifying and removing columns that are not needed for analysis

    * Example: A dataset of customer information might contain a `Customer_ID` column that is unique to each record and doesn't provide any useful information for analysis. We can remove this column using `df.drop()`.

* Removing duplicate rows

    * Using Pandas `df.duplicated()` to identify duplicate rows

    * Removing duplicates using `df.drop_duplicates()`
 
```python
# Example: Removing the 'Customer_ID' column
df = df.drop('Customer_ID', axis=1)

# Example: Removing multiple columns
columns_to_drop = ['Transaction_Time', 'Unused_Column']
df = df.drop(columns_to_drop, axis=1)
```

#### Removing Duplicate Rows

__Identifying Duplicates:__ 

* Duplicate rows are identical entries in our dataset. They don't provide any new information and can skew our analysis. 

* We can find them using `df.duplicated()`. 

* This returns a Series of True or False values, where True indicates a row that is a duplicate of a previous row.

__Removing Duplicates__ 

* To remove these duplicate rows, we use the `df.drop_duplicates()` function.
* By default, it keeps the first occurrence of each unique row.

```python
# Example: Removing duplicate rows, keeping the first occurrence
df = df.drop_duplicates()

# Example: Removing duplicates, keeping the last occurrence
df = df.drop_duplicates(keep='last')

# Example: Removing duplicates based on specific columns
df = df.drop_duplicates(subset=['CustomerID', 'ProductName'])
```

__Consider the Following:__

A dataset of website traffic containing duplicate entries.  

In [29]:
## Begin Example
# Create a sample DataFrame with duplicate rows
traffic_data = {'User_ID': [101, 102, 101, 103, 102, 104],
                'Page_Visited': ['Home', 
                                 'Products', 
                                 'Home', 
                                 'Services', 
                                 'Products', 
                                 'Contact'
                                ],
                'Timestamp': ['2024-07-28 10:00:00', 
                              '2024-07-28 10:15:00', 
                              '2024-07-28 10:00:00', 
                              '2024-07-28 10:30:00', 
                              '2024-07-28 10:15:00', 
                              '2024-07-28 10:45:00'
                             ]}

traffic_df = pd.DataFrame(traffic_data)
traffic_df
## End Example

Unnamed: 0,User_ID,Page_Visited,Timestamp
0,101,Home,2024-07-28 10:00:00
1,102,Products,2024-07-28 10:15:00
2,101,Home,2024-07-28 10:00:00
3,103,Services,2024-07-28 10:30:00
4,102,Products,2024-07-28 10:15:00
5,104,Contact,2024-07-28 10:45:00


In [None]:
We can remove these duplicate rows.

In [16]:
## Begin Example
# Identify duplicate rows
print(f"Duplicate Row Series\n\n {traffic_df.duplicated()}")

# Remove duplicate rows
cleaned_traffic_df = traffic_df.drop_duplicates()
print(f"\n\nCleaned DataFrame\n\n {cleaned_traffic_df}")
## End Example

Duplicate Row Series

 0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool


Cleaned DataFrame

    User_ID Page_Visited            Timestamp
0      101         Home  2024-07-28 10:00:00
1      102     Products  2024-07-28 10:15:00
3      103     Services  2024-07-28 10:30:00
5      104      Contact  2024-07-28 10:45:00


### Filtering Data

<center><img src="../images/stock/pexels-cottonbro-4107284.jpg" width="500"></center>

Selecting specific rows based on conditions

Using boolean indexing to filter rows that meet certain criteria

__Consider the Following:__

A dataset of sales transactions contains information about different products. 

In [18]:
## Begin Example
# Create a sample DataFrame
sales_data = {'Transaction_ID': [1, 2, 3, 4, 5],
                'Product': ['Product A', 'Product B', 'Product A', 'Product C', 'Product A'],
                'Quantity': [10, 5, 20, 15, 30],
                'Price': [25.00, 50.00, 25.00, 75.00, 25.00]}
sales_df = pd.DataFrame(sales_data)

sales_df
## End Example

Unnamed: 0,Transaction_ID,Product,Quantity,Price
0,1,Product A,10,25.0
1,2,Product B,5,50.0
2,3,Product A,20,25.0
3,4,Product C,15,75.0
4,5,Product A,30,25.0


We can filter the data to include only transactions for a specific product.

In [19]:
## Begin Example
# Filter for transactions of 'Product A'
product_a_sales = sales_df[sales_df['Product'] == 'Product A']
product_a_sales
## End Example

Unnamed: 0,Transaction_ID,Product,Quantity,Price
0,1,Product A,10,25.0
2,3,Product A,20,25.0
4,5,Product A,30,25.0


### Handling Outliers

<center><img src="../images/stock/pexels-shvetsa-5217889.jpg" width="500"></center>

Techniques for handling outliers:

* Removing outliers
* Transforming outliers 
* Imputing outliers with a reasonable value

__Consider the Following__

A dataset of customer ages contains some extremely high values. 

In [20]:
## Begin Example
# Create a sample DataFrame with outliers
age_data = {'Customer_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'Age': [25, 30, 28, 35, 22, 80, 32, 27, 90, 29]}
age_df = pd.DataFrame(age_data)
age_df
## End Example

Unnamed: 0,Customer_ID,Age
0,1,25
1,2,30
2,3,28
3,4,35
4,5,22
5,6,80
6,7,32
7,8,27
8,9,90
9,10,29


We can remove these.

In [22]:
## Begin Example
# Identify outliers using z-score
from scipy.stats import zscore
age_df['Z-Score'] = zscore(age_df['Age'])
age_df
## End Example

Unnamed: 0,Customer_ID,Age,Z-Score
0,1,25,-0.6446
1,2,30,-0.42683
2,3,28,-0.513938
3,4,35,-0.20906
4,5,22,-0.775263
5,6,80,1.750874
6,7,32,-0.339722
7,8,27,-0.557492
8,9,90,2.186414
9,10,29,-0.470384


In [23]:
## Begin Example
# Remove outliers with a z-score greater than 2 or less than -2
cleaned_age_df = age_df[(age_df['Z-Score'] < 2) & (age_df['Z-Score'] > -2)]
print(cleaned_age_df)
## End Example

   Customer_ID  Age   Z-Score
0            1   25 -0.644600
1            2   30 -0.426830
2            3   28 -0.513938
3            4   35 -0.209060
4            5   22 -0.775263
5            6   80  1.750874
6            7   32 -0.339722
7            8   27 -0.557492
9           10   29 -0.470384


### Correcting Inconsistencies
<center><img src="../images/stock/pexels-uriel-marques-1847899-3497522.jpg" width="500"></center>

Identifying and correcting inconsistencies in data

* Standardizing text data (e.g., converting to lowercase, removing extra spaces)

* Correcting data type errors (e.g., converting strings to numeric or date format)

__Consider the following__

A dataset of product names containing inconsistencies such as different capitalization and extra spaces. 

In [34]:
## Begin Example
# Create a sample DataFrame with inconsistent text data
product_data = {'Product_ID': [1, 2, 3, 4, 5],
                'Product_Name': ['Product A', ' product b ', 'Product C', 'PRODUCT A ', 'Product D'],
                'Review_Date': ['2024-01-15', '2024-02-20', '2024-03-25', '2024-04-30', 'May 05, 2024']}
product_df = pd.DataFrame(product_data)
product_df
## End Example

Unnamed: 0,Product_ID,Product_Name,Review_Date
0,1,Product A,2024-01-15
1,2,product b,2024-02-20
2,3,Product C,2024-03-25
3,4,PRODUCT A,2024-04-30
4,5,Product D,"May 05, 2024"


Using `str.lower()` and `str.strip()` We can standardize the names.

In [27]:
## Begin Example
# Standardize product names
product_df['Product_Name'] = product_df['Product_Name'].str.lower().str.strip()
product_df
## End Example

Unnamed: 0,Product_ID,Product_Name,Review_Date
0,1,product a,2024-01-15
1,2,product b,2024-02-20
2,3,product c,2024-03-25
3,4,product a,2024-04-30
4,5,product d,"May 05, 2024"


Also, the review dates column has some dates stored as strings.  
We can convert them to `datetime` objects using `pd.to_datetime()`.

In [28]:
## Begin Example
# Correct data type errors
product_df['Review_Date'] = pd.to_datetime(product_df['Review_Date'], errors='coerce')
print(product_df)
## End Example

   Product_ID Product_Name Review_Date
0           1    product a  2024-01-15
1           2    product b  2024-02-20
2           3    product c  2024-03-25
3           4    product a  2024-04-30
4           5    product d         NaT
