# Cleaning Data in Python

- Diagnose dirty data
- Clean data

## Data Type Constraints
- We gotta make sure that we are working with the right data type
- So we will be working with conversions here

1. To convert string to int we use `.str` to columns:

        `df['col'].str.strip/replace()`
        
        `df['col'].astype('int')`
        
        **IF** you want to verify that revenue is now a specific data type (`int`), you can do:
        
        `assert df['col'].dtype == 'int'`

2. To make an integer a category/string, we stil use `.astype('category')`
   - We need to do this because when we do `.describe` it calculates the wrong summary stats
```python
df['col_int'] = df['col_int'].astype('category')
```

## Data Range Constraints

To deal out of range data:
- we drop them
- setting custom minimums and maximums
- treat as missing and impute it
- setting custom value depending on business assumptions

One good syntax for assigning a value to out of ranges is 

`df.loc[df['col'] > 5, 'col'] = 5`

1. Dropping them:
```python
df.drop(df[condition].index, inplace = True)
```

### Working with Date Ranges
- To check for date ranges what you do is get the today's date and check if anything surpasses it

## Uniqueness Constraints

The `df.duplicated()` method in Pandas is used to identify duplicate rows within a DataFrame.

**Important Consideration:**

By default, `df.duplicated()` marks all occurrences *after* the first one as `True`. This means the *first* instance of a duplicate set will be marked as `False`, which can sometimes be misleading if you want to see *all* parts of a duplicate set.

**Enhancing Duplicate Detection with Arguments:**

To achieve more precise control over duplicate identification, utilize the following arguments:

* **`subset`**: This argument accepts a list of column names. When provided, the duplication check will only be performed on the specified columns. This is useful when you want to define a duplicate based on a subset of your data rather than the entire row.

* **`keep`**: This argument controls which duplicate observations (if any) are marked as `False` (not a duplicate) or `True` (a duplicate). It accepts three values:
    * `'first'` (default): Marks all duplicates as `True` except for the first occurrence.
    * `'last'`: Marks all duplicates as `True` except for the last occurrence.
    * `False`: Marks *all* occurrences of duplicate rows as `True`, including the first and last. This is often the preferred setting when you want to identify every single row that is part of a duplicate set.

#### Treating Duplicates
There are two types of duplicates and each one require a specific strategy:

1. Complete duplicates: All columns are duplicated
2. Duplicates with discrepancy: Only some columns are duplicated

To deal with the **first one**:
- Only keep one of them
- You can use the `.drop_duplicates`
  - This has the same arguments as the `.duplicated()`

```python
# now remember that firt is the default so the next occurences only will be dropped
df.drop_duplicates(inplace = True)
```

To deal with the **second one**:
- For example there are only discrepancy with the heigh, what you can do is use stat methods based on your understanding of the attributes/fields (You can get the mean, max, etc)
- The `.groupby()` and `.agg()` is actually used for this

For example, if you are dealing with people's height and weight and there are duplicates with the `name`, `address` but differences with the `height` and `weight`, you can do:

```python
# cols to group by
cols = ['names', 'address']

# sum stat
summaries = {'height': 'max', 'weight': 'mean'}

df = df.groupby(by = cols).agg(summaries).reset_index() 
```

Basically, what happens here is you grouped all the rows, now given that most rows are unique they would not change even if you subjected it to the sum stat, only those with dupicates

## II. Membership Constraints (Categorical Data)

### Range Inconsistency
How do you find inconsistent data (out of the category)

**Workflow:**

0. Check first if there are inconsistencies using the `.unique()`

1. Get the set difference from the data with incosistency to the categorical data

```python
# this will return a set whose values are in set A but not in set B
inconsisten_categories = set(current_df['cat_col']).difference(cat_df['cat_col_complete'])
```
2. Now, you want to know what are the rows that has this elements of the resulting set from the previous step using the `.isin()` and subset the current dataframe

```python
inconsisten_rows = current_df['cat_col'].isin(inconsistent_categories)
current_df[inconsisten_rows]
```
3. Work on it
   - You can drop it by simply using `~`: `current_df[~inconsistent_rows]`

### Collapsing too many categories to Few

Let's take for example, we are woking on incomes and we want to categorize incomes to ranges, you can use `cut()` for it (inclusivity, `(x, y]`):

**Ranges:**
```python
ranges = [0, 200000, 500000, np.inf]

group_names = ['0-200k', '200k-500k', '500k+']

df['new_cat_col'] = pd.cut(df['income'], bins = ranges, labels = group_names)
```

Sometimes, we may want to make our categories fewer, for example if we have categories: `['Microsoft', 'MacOS', 'IOS' ...]` and you want to just be `['DesktopOS', 'MobileOS']`

To do that we can use the `.replace()`

```python
mapping = {'MicrosoftOS': 'DesktopOS', 'Linux': 'DesktopOS', 'IOS': 'MobileOS', ...}
df['os'] = df['os'].replace(mapping)
```

### Cleaning Text Data

This is just using regex

For example, if you have a column with phone numnbers `['+63976-202-5431', ..., '45643']`

and what you want is to have a `09762025431` and remove number whose length is not 11, you can use the `.str.replace()`

```python
df['phone_numbers'] = df['phone_numbers'].str.replace('+63', '0').str.replace('-', '')

df.loc[df['phone_numbers'].str.len() < 11, 'phone_numbers']] = np.nan
```
**Assertion with text data:**

You can perform assertion as well like ths:

```python
sanity_check = df['phone_numbers'].str.len() 

assert sanity_check.min() >= 11

assert df['phone_numbers'].str.contains("+63|-").any() == False
```

## III. Uniformity

Uniformity is all about units, for example celcius to fahrenheits, dates of different formats.

So whenever you are working with columns, think if it is possible to have different formats.

**Dates:**

Just remember that when working with dates, you can always use the .`datetime()` 
- It accepts any format and return you a uniform format
You can do this via:

```python
df['date_cols'] = pd.to_datetime(df['date_cols'])
```

Now, this may return an error because of weird formats, what we can do then is have an argument:

```python
df['date_cols'] = pd.to_datetime(df['date_cols'], errors = 'coerce'
# return NA
```

**Reformatting the dates:**

You can use the `df.strftime("")` on a series of dates

### Cross Field Validation

The use of **multiple** fields to sanity check the data integrity

One good exammple of this is if you have columns: `age`, `birth_date`

You can sanity check the age by subtracting the today's date to the birthdate

```python
df['birthday'] = pd.to_datetime(users['birthday'])
today = dt.date.today()

age_manual = today.year - df['birthday'].dt.year

age_equ = age_manual == df['age']

inconsistent_age = df[~age_equ]

consistent_age = df[age_equ]
```

### Completeness

#### Checking For Missing Values
- We chain: `.isna()` and `.sum()`

Example:
`df.isna().sum()`

## IV. Record Linkage

### Comparing strings

**Minimum Edit Distance** is a way to measure how different two strings are. It tells us the minimum number of operations needed to convert one string into another.

The allowed operations are usually:
- **Insertion**: Add a character.
- **Deletion**: Remove a character.
- **Substitution**: Replace one character with another.

#### Example:
Convert `"kitten"` to `"sitting"`:

1. kitten → sitten (substitute 'k' with 's')  
2. sitten → sittin (substitute 'e' with 'i')  
3. sittin → sitting (insert 'g')

🟰 Minimum edit distance = **3**

The lowest the edit distance the better, that means that the words can be related

There are different algorithms to use for this problem for differetn scenarios


### Simple String Comparison
We can use `thefuzz` to compare within each string, the output of this is from 0 - 100, 100 being two words are similar

```python
from thefuzz import fuzz

fuzz.WRatio('Reeding', 'Reading')
```

### Comparison with Arrays

You can compare a string to an array of stiring using `thefuzz` 

```python
from thefuzz import process

string = 'Housten Rockets vs Los Angeles Lakers'

choices = pd.Series([..., ..., ...])

process.extract(string, choices, limit = n)
```

this return a tuple `(matching string, score, index)`

### Collapsing Categories with String Similarity

Now, there will be times when we are presented with a series of different values (perhaps a typo), we can use string similarity to collapse those typos to a single category.

The thought process to this is:

1. Get the unique values of the column with typos
2. Investigate the lowest score needed for the match with `process.extract('', column)`
3. Get the lowest score that matched and use that to transform all the typos

### Sample Code:
```python
from thefuzz import process
import pandas as pd

# 1. Get the unique values of the column with typos
typos = [
    "Italy", "Iatily", "Ityaly", "italy",
    "USA", "Usa", "us", "u.s.a",
    "Germany", "germany", "Ger", "germ"
]

# The correct categories to match against
correct_categories = ["Italy", "USA", "Germany"]

# Create a DataFrame for demonstration
df = pd.DataFrame({"country_typos": typos})
unique_typos = df["country_typos"].unique()

# 2. Investigate the lowest score needed for the match
# We'll use a threshold of 75 as an example. You would determine this
# by manually checking the scores of various matches.
threshold = 75

def collapse_category(typo_value):
    """
    Finds the best match for a typo value against the correct categories.
    """
    # process.extract returns a list of tuples: (matched_string, score, index)
    best_match = process.extractOne(typo_value, correct_categories)

    # If the score is above the threshold, return the best match.
    if best_match and best_match[1] >= threshold:
        return best_match[0]
    else:
        # If no match is found, return the original value or None
        return typo_value

# 3. Transform all the typos
df["cleaned_country"] = df["country_typos"].apply(collapse_category)

print(df)
```

### Record Linkage

There is a much better notes about this in a markdown

It is the act of linking data from different sources regarding the same entity.

To do this:
- We generally clean two or more DataFranes
- Generate pairs of potentially matching records
- Score these pairs according to string similarity metrics
- Link them

For example. if you have two dataframes of census, they are taken from various sources so you can't simply merge them because there may be duplication.

We want to generate pairs from each df (kinda like cartesian product) however we cannot simply do that becasue our dataframe can grow and therefore this combination can scale

But what we can do then is look for the matching column (this is called blocking) like state

```python
import recordlinkage

# use to generate pairs from dfs
indexer = recordlinkage.Index()

# Generate pairs blocked on state
indexer.block('state')

pairs = indexer.index(df1, df2)
```

The output is array of possible pairs of indices

Once you found _possible_ pairs, it is now time to compare those pairs if they **really** match
```python
# generate pairs
pairs = indexer.index(df1, df2)

# creatae a comparing object 
compare_obj = recordlinkage.Compare()

# find exact matches for pairs of col 1 and col 2
compare_obj.exact('col1', 'col1', label = 'col1')
compare_obj.exact('col2', 'col2', label = 'col2')
# these are the columns that you know are exact

# find similar matches for pairs of other columns that mau ne be exact
compare_obj.string('col_x', 'col_x', threshold = 0.85, label = 'col_x')
compare_obj.string('col_y', 'col_y', threshold = 0.85, label = 'col_y')

# find matches
potential_matches = compare_obj.compute(pairs, df1, df2)
```

The result of this is a multi-index dataframe, the first index is the row index in the first df and the second index is the list of second index in df2.

To find for potential matches, we simpy just sum rows and get the sum whose value exceeds our threshold.