In [1]:
## Begin Setup
import pandas as pd
import numpy as np
## End Setup

# Intro to Pandas Data Cleaning 

<center><img src="../images/stock/pexels-pixabay-434163.jpg" width="500"></center>

## Introduction
Why is data cleaning important?

* Real-world data is often messy, with errors, inconsistencies, and missing values.

* Cleaning ensures data quality, leading to more reliable analysis and accurate conclusions.

## Overview of common data cleaning steps:

* __Handling Missing Values__: Sometimes data has empty spots. We need to decide what to do with them

* __Removing Irrelevant Data__: Not all the information we have will be useful. We want to get rid of things like duplicate entries or columns that don't tell us anything new. 

* __Filtering Data__: Often, we only want to look at a specific part of our data. Filtering lets us select just the rows that meet certain conditions.

* __Handling Outliers__: Outliers are data points that are very different from the rest. They can sometimes be real, but other times they're errors. 

* __Correcting Inconsistencies__: Data can be recorded in different ways even when it means the same thing (like "USA" vs. "United States"). We need to make sure these things are consistent so our analysis is accurate.

## Dealing With Missing Values

<center><img src="../images/stock/pexels-matilda-wormwood-4099467.jpg" width="500"></center>

Sometimes, our data has empty spots, which Pandas represents as `NaN` (Not a Number). We need to deal with these so they don't mess up our analysis.

### Identifying Missing Values

* We can use `df.isna()` to create a table where `True` indicates a missing value and `False` means the value is present.

* Conversely, `df.notna()` gives us a table where `True` means the value is present and `False` indicates a missing value.

### Handling Missing Data

There are a few common ways to handle missing values:

* __Removing Missing Data:__

    * We can remove entire rows that have missing values using `df.dropna()`. Be careful, as this can reduce the size of your dataset.
    * We can also remove entire columns with missing values by using `df.dropna(axis=1)`. This is useful if a whole feature has a lot of missing information.


* __Filling Missing Data:__

    * Instead of removing data, we can try to fill in the gaps with estimated values using `df.fillna()`:

        * Fill with a specific value: `df.fillna(0)` would replace all `NaNs` with 0. You can use any value that makes sense for your data.
        * Fill with the mean: `df.fillna(df.mean())` replaces missing numerical values with the average of that column.
        * Fill with the median: `df.fillna(df.median())` replaces missing numerical values with the middle value, which is less affected by extreme values.

By using these Pandas functions, we can effectively identify and handle missing data in our datasets.

### Example: Missing Values 

__Consider the Following:__

A dataset of customer information contains missing values in the `Email`, `Phone Number`, and `Address` columns.

In [5]:
### Begin Example
# Create a sample DataFrame
data = {'Customer_ID': [i for i in range(1,6)],
        'Email': ['john.doe@example.com', 
                  np.nan, 
                  'jane.smith@example.com', 
                  'alice.johnson@example.com', 
                  np.nan
                 ],
        'Phone Number': ['123-456-7890', 
                         '987-654-3210', 
                         np.nan, 
                         '555-123-4567', 
                         np.nan
                        ],
        'Address': ['123 Main St', 
                    '456 Oak Ave', 
                    '789 Pine Ln', 
                    np.nan, 
                    '246 Elm St'
                   ]}

customer_df = pd.DataFrame(data)

customer_df

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,1,john.doe@example.com,123-456-7890,123 Main St
1,2,,987-654-3210,456 Oak Ave
2,3,jane.smith@example.com,,789 Pine Ln
3,4,alice.johnson@example.com,555-123-4567,
4,5,,,246 Elm St


We can use `df.isna()` to locate these missing values.

In [6]:
### Begin Example

# Locate Missing Values
customer_df.isna()
### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,False,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,False,False,True
4,False,True,True,False


If many rows have missing 'Email' or 'Phone Number', we might drop those rows using `df.dropna()`.

By default, it drops a row if any of its columns have a missing value. The `subset` argument allows you to specify which columns should be considered when identifying missing values for dropping rows. If a missing value is found in any of the columns listed in `subset` for a given row, that row will be dropped.

In [8]:
### Begin Example
# Drop rows where 'Email' or 'Phone Number' is missing

# Output Results


### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,1,john.doe@example.com,123-456-7890,123 Main St


If only a few values are missing, we might fill them with a placeholder--we can use `fillna()`. 

* For `Address`, we might use `Unknown`.
* For `Email` and `Phone Number` we might use `""`

In [12]:
### Begin Example
# Re-Create the sample DataFrame
data = {'Customer_ID': [i for i in range(1,6)],
        'Email': ['john.doe@example.com', 
                  np.nan, 
                  'jane.smith@example.com', 
                  'alice.johnson@example.com', 
                  np.nan
                 ],
        'Phone Number': ['123-456-7890', 
                         '987-654-3210', 
                         np.nan, 
                         '555-123-4567', 
                         np.nan
                        ],
        'Address': ['123 Main St', 
                    '456 Oak Ave', 
                    '789 Pine Ln', 
                    np.nan, 
                    '246 Elm St'
                   ]}

customer_df = pd.DataFrame(data)

# Fill missing 'Address' values with 'Unknown'





### End Example

Unnamed: 0,Customer_ID,Email,Phone Number,Address
0,1,john.doe@example.com,123-456-7890,123 Main St
1,2,,987-654-3210,456 Oak Ave
2,3,jane.smith@example.com,,789 Pine Ln
3,4,alice.johnson@example.com,555-123-4567,
4,5,,,246 Elm St


## Removing Irrelevant Data

<center><img src="../images/stock/pexels-cottonbro-4108715.jpg" width="500"></center>

Sometimes, our datasets contain information that doesn't help us with our analysis. Getting rid of this "noise" makes our data cleaner and easier to work with.

### Identifying and Removing Unnecessary Columns:

Often, datasets include columns that are unique identifiers (like `Customer_ID`) or provide information that isn't relevant to our questions.

For example, if we're analyzing customer purchase behavior, the unique `Customer_ID` might not tell us anything about what they buy.

To remove one or more columns, we use the `df.drop()` function. We need to specify the names of the columns we want to remove and set `axis=1` to indicate we're dropping columns (as opposed to rows).

__Syntax Example__

```python
# Example: Removing a single column
df = df.drop('Column1', axis=1)

# Example: Removing multiple columns
columns_to_drop = ['Column1', 'Column2']
df = df.drop(columns_to_drop, axis=1)
```

### Removing Duplicate Rows

Duplicate rows are identical entries in our dataset. They don't provide any new information and can skew our analysis. 

* We can find them using `df.duplicated()`. 

* This returns a Series of True or False values, where True indicates a row that is a duplicate of a previous row.

* To remove these duplicate rows, we use the `df.drop_duplicates()` function.
    * By default, it keeps the first occurrence of each unique row.

```python
# Example: Removing duplicate rows, keeping the first occurrence
df = df.drop_duplicates()

# Example: Removing duplicates, keeping the last occurrence
df = df.drop_duplicates(keep='last')

# Example: Removing duplicates based on specific columns
df = df.drop_duplicates(subset=['Column1', 'Column2'])
```

### Example: Removing Duplicate Entries

__Consider the Following:__

A dataset of website traffic containing duplicate entries.  

In [13]:
## Begin Example
# Create a sample DataFrame with duplicate rows
traffic_data = {'User_ID': [101, 102, 101, 103, 102, 104],
                'Page_Visited': ['Home', 
                                 'Products', 
                                 'Home', 
                                 'Services', 
                                 'Products', 
                                 'Contact'
                                ],
                'Timestamp': ['2024-07-28 10:00:00', 
                              '2024-07-28 10:15:00', 
                              '2024-07-28 10:00:00', 
                              '2024-07-28 10:30:00', 
                              '2024-07-28 10:15:00', 
                              '2024-07-28 10:45:00'
                             ]}

traffic_df = pd.DataFrame(traffic_data)

traffic_df
## End Example

Unnamed: 0,User_ID,Page_Visited,Timestamp
0,101,Home,2024-07-28 10:00:00
1,102,Products,2024-07-28 10:15:00
2,101,Home,2024-07-28 10:00:00
3,103,Services,2024-07-28 10:30:00
4,102,Products,2024-07-28 10:15:00
5,104,Contact,2024-07-28 10:45:00


Let's Identify the duplicates.

In [15]:
## Begin Example
# Identify duplicate rows


## End Example

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [None]:
We can remove these duplicate rows.

In [17]:
## Begin Example
# Remove duplicate rows



## End Example

Unnamed: 0,User_ID,Page_Visited,Timestamp
0,101,Home,2024-07-28 10:00:00
1,102,Products,2024-07-28 10:15:00
3,103,Services,2024-07-28 10:30:00
5,104,Contact,2024-07-28 10:45:00


## Filtering Data

<center><img src="../images/stock/pexels-cottonbro-4107284.jpg" width="500"></center>

__Filtering__ allows you to select specific rows from your data based on conditions, enabling targeted analysis. This is primarily achieved using __boolean indexing__.

### Boolean Indexing: The Core Technique

__Boolean indexing__ is a fundamental and efficient way to extract meaningful information from your Python datasets.

__Steps:__

1. Create a Boolean Mask: Define a condition that, when applied to a data structure (like a Pandas Series or DataFrame column), produces a Series of True and False values. `True` indicates rows that satisfy the condition.

__Syntax Example__

```python
mask = df['Column Name'] condition
```

2. Apply the Boolean Mask: Use the generated mask to index the original data structure. This returns a new structure containing only the rows where the mask is `True`.

__Syntax Example__
```python
filtered_df = df[mask]
```

### Example: Boolean Indexing

Consider a dataset of sales transactions contains information about different products. 

In [None]:
## Begin Example
# Create a sample DataFrame
sales_data = {'Transaction_ID': [1, 2, 3, 4, 5],
                'Product': ['Product A', 
                            'Product B', 
                            'Product A', 
                            'Product C', 
                            'Product A'
                           ],
                'Quantity': [10, 5, 20, 15, 30],
                'Price': [25.00, 50.00, 25.00, 75.00, 25.00]}

sales_df = 

## End Example

Let's create a boolean mask to focus only on `Product A`.

In [None]:
## Begin Example

# Product A Mask

## End Example

Now let's apply the mask to created a filtered DataFrame.

In [None]:
## Begin Example

# Apply Mask

## End Example

## Handling Outliers

<center><img src="../images/stock/pexels-shvetsa-5217889.jpg" width="500"></center>

Outliers are data points that significantly deviate from the rest of your dataset. They can skew analysis and impact model performance.

__Common Techniques:__

* __Removing Outliers:__ Deleting the extreme data points. Use with caution, as you might lose valuable information.

* __Transforming Outliers:__ Applying mathematical functions (e.g., log, square root) to reduce the impact of extreme values.

* __Imputing Outliers:__ Replacing outliers with a more representative value (e.g., mean, median, a capped value).

### Example: Handling Outliers

__Consider the Following__

A dataset of customer ages contains some extremely high values. 

In [18]:
## Begin Example
# Create a sample DataFrame with outliers
age_data = {'Customer_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'Age': [25, 30, 28, 35, 22, 80, 32, 27, 90, 29]}

age_df = pd.DataFrame(age_data)

age_df
## End Example

Unnamed: 0,Customer_ID,Age
0,1,25
1,2,30
2,3,28
3,4,35
4,5,22
5,6,80
6,7,32
7,8,27
8,9,90
9,10,29


For demonstration purposes, let's remove ages that are above an arbitrary limit of 70 (note: this not statistically sound for real world data). 

First we identify the outliers.

In [19]:
## Begin Example

# Set Upper Limit


# Create Mask


## End Example

0     True
1     True
2     True
3     True
4     True
5    False
6     True
7     True
8    False
9     True
Name: Age, dtype: bool

Now let's get those outliers out of there.

In [20]:
## Begin Example
# Remove outliers


## End Example

Unnamed: 0,Customer_ID,Age
0,1,25
1,2,30
2,3,28
3,4,35
4,5,22
6,7,32
7,8,27
9,10,29


## Correcting Inconsistencies
<center><img src="../images/stock/pexels-uriel-marques-1847899-3497522.jpg" width="500"></center>

Data inconsistencies can arise from various sources and hinder accurate analysis. Identifying and correcting these issues is a crucial step in data preparation.

__Common Correction Techniques:__

* _Standardizing Text Data:_ Ensuring uniformity in text by:

    * Converting all text to the same case (e.g., lowercase or uppercase).
    * Removing leading, trailing, and extra internal spaces.
    * Handling variations in spelling or abbreviations.


* __Correcting Data Type Errors:__ Ensuring data is stored in the appropriate format for analysis:

    * Converting strings that represent numbers into numeric types (integers or floats).
    * Parsing strings representing dates into datetime objects.
    * Changing other data types as needed (e.g., booleans to integers).

### Example: Inconsistent Data

__Consider the following:__

A dataset of product names containing inconsistencies such as different capitalization and extra spaces. 

In [None]:
## Begin Example
# Create a sample DataFrame with inconsistent text data
product_data = {'Product_ID': [i for i in range(5)],
                'Product_Name': ['Product A', ' product b ', 'Product C', 'PRODUCT A ', 'Product D'],
                'Review_Date': ['2024-01-15', '2024-02-20', '2024-03-25', '2024-04-30', 'May 05, 2024']}
product_df = pd.DataFrame(product_data)

product_df
## End Example

Using `str.lower()` and `str.strip()` We can standardize the names.

* `str.lower()` - This function is used to convert all characters within each string in a Pandas Series to lowercase.

* `str.strip()` - This function is used to remove leading and trailing whitespace from each string in a Pandas Series. Whitespace includes spaces, tabs (`\t`), and newline characters (`\n`) at the beginning and end of a string.

In [None]:
## Begin Example

# Standardize product names

## End Example

Also, the review dates column has some dates stored as strings.  

We can convert them to `datetime` objects using `pd.to_datetime()`.

* `pd.to_datetime()` is a Pandas function that transforms various date and time formats, such as strings, into Pandas datetime objects. This is necessary for performing time-related operations on your data.

In [None]:
## Begin Example

# Correct data type errors

## End Example

What's this error?

`pd.to_datetime()` tries to infer the format automatically, it might not correctly parse this mixed format without explicit instructions. 

The `errors='coerce'` argument tells Pandas that if it encounters any string it cannot understand as a date, it should replace it with NaT (Not a Time) instead of raising an error, making the data cleaning process more robust.
    
__Syntax__
```python
df["Column"] = pd.to_datetime(df['Column'],errors='coerce')
```

Let's give it a try.

In [None]:
## Begin Example

## End Example

## Summary: Preparing Data for Analysis

<center><img src="../images/stock/pexels-steffi-wacker-1093179-6626084.jpg" width="500"></center>

In this lesson, we've covered essential techniques for preparing your data for meaningful analysis in Python.

__We explored:__

* __Filtering Data:__ Selecting specific subsets of your data based on conditions using boolean indexing. This allows you to focus on relevant information.
* __Handling Outliers:__ Identifying and addressing extreme values that can skew your analysis. We discussed basic approaches like setting arbitrary limits and using the median with a threshold.
* __Correcting Inconsistencies:__ Ensuring uniformity and accuracy in your data by standardizing text (case, whitespace) and converting data types (like strings to datetime objects using `pd.to_datetime()`).

__Key Takeaway:__

Clean and well-prepared data is the foundation of reliable and insightful analysis. By mastering these techniques, you can significantly improve the quality of your findings and build more robust data-driven solutions. Remember that the specific steps and techniques you apply will depend on the unique characteristics and challenges of your dataset.

In [None]:
# Demonstration


