## Week 8 Lecture `.ipynb` File

#### Author: Mahmoud Harding

## Data Cleaning and Data Moves

### `pandas`

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

Pandas is a fast, powerful, flexible and (sometimes) easy to use open source data analysis and manipulation tool. Click the `Cheat Sheet` below to access the Data Wrangling with `pandas` [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

In [None]:
import pandas as pd

In [None]:
ss = ...

In [None]:
ss.info()

In [None]:
ss.head()

## Data Cleaning & Data Wrangling

Data cleaning and data wrangling are related but distinct processes in data preparation. Data cleaning focuses on correcting errors, handling missing values, removing duplicates, and ensuring data consistency. It ensures that the dataset is accurate and reliable. Data wrangling, on the other hand, is a broader process that includes data cleaning, but also involves restructuring, transforming, merging, and filtering data to make it suitable for analysis. While cleaning improves data quality, wrangling prepares data for specific tasks by reshaping it into a usable format.

**Example 1.** Drop `Unnamed: 0` column.

In [None]:
ss = ss.drop(columns = ['Unnamed: 0'])

**Example 2.** Rename columns by replacing periods with underscores and converting to `snake_case`.

In [None]:
ss = ss.rename(columns = {'status.completed' : 'year_completed',
                          'status.started' : 'year_started',
                          'height in meters' : 'height_meters'
                         }
              )

**Example 3.** Convert values in the `height_meters` column to numeric data type and create a new column named `clean_height_meters`.

In [None]:
ss['clean_height_meters'] = ss['height_meters'].str.replace(",", "").astype(float)

In [None]:
ss.info()

### Data Cleaning
We can identify observations in the floors column that are not strictly integers by using the string method `.isdigit()` or the Series methods `.str.isdigit()` or `.str.contains()`.  

- String Method
  
  - `.isdigit()`: Checks if a string has only digits (`0-9`). Returns `True` for whole numbers but `False` for decimals, negatives, or mixed characters. Works only on single strings.  

- Series Methods 

  - `.str.isdigit()`: Applies `.isdigit()` to each value in a Series. Detects whole numbers but **not** decimals or negatives. Returns `NaN` for missing values.  

  - `.str.contains()`: Checks if values in a Series contain a substring or match a regex pattern. Supports advanced searches like finding non-numeric values (`r'\D'`).  

In [None]:
ss.loc[48, ]

In [None]:
ss.loc[61, ]

#### `.isdigit()`

In [None]:
ss['floors'][48].isdigit()

In [None]:
ss['floors'][61].isdigit()

In [None]:
ss['floors'].str.isdigit()

**Example 4.** Apply a Boolean mask to filter the `ss` DataFrame and identify observations where the `floors` column contains non-digit values.

In [None]:
mask = ss['floors'].str.isdigit()
ss[mask]

**Example 5.** Use `~` to reverse the Boolean mask.

In [None]:
ss[~mask]

#### `.str.contains()`

The code below filters the `ss` `DataFrame` to find rows where the `floors` column contains at least one non-digit character.

```python
ss[ss['floors'].str.contains(r'\D', na=False)]
```

**Step-by-Step Explanation:**

- `ss['floors']`: Selects the `floors` column from the `ss` `DataFrame`.

- `.str.contains(r'\D', na = False)`

    - `.str.contains()`: Checks if each value in the column contains a certain pattern.

    - `r'\D'

        - `\D` is a regex pattern that matches any non-digit character (letters, symbols, spaces, etc.). Regular expressions (regex) are used for searching, matching, and manipulating text based on patterns. They help identify specific text structures, such as email addresses, phone numbers, and numerical values. Regex is commonly used for data cleaning, validation, and extraction, making it a valuable tool for working with text data. Click **[here](https://media.datacamp.com/legacy/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf)** to view a cheatsheet of basic regular expressions.
     
         - The `r''` (raw string) ensures that Python correctly interprets `\D`.
   
         - `na = False`: Ensures that `NaN` (missing values) are treated as `False`, so they are **not** included in the filtered results.

- `ss[...]`: Filter the `DataFrame` to return only the rows where the condition is `True`, meaning `floors` contains non-numeric characters.

In [None]:
ss[ss['floors'].str.contains(r'\D', na = False)]

In [None]:
ss.loc[48, 'floors'] = 103
ss.loc[61, 'floors'] = 73

In [None]:
print(ss.loc[48, 'floors'])
print(ss.loc[61, 'floors'])

In [None]:
ss['floors'] = pd.to_numeric(ss['floors'])

In [None]:
ss.info()

In [None]:
ss['floors'].describe()

In [None]:
ss['country'].value_counts()

This code

```python
ss['country'].value_counts()
```

executed without any errors, but the output is incorrect because the same country has been entered in different ways, leading to inconsistent results. For instance:

- USA, US, and United States of America are listed separately, though they refer to the same country.

- United Arab Emirates appears multiple times as United Arab Emirates (UAE) and UAE.

- Malaysia is misspelled as Malasya.

- Saudi Arabia is listed as saudi Arabia (with a lowercase "s").

### Filter

Previoously in the course we used a Boolean mask to filter an R dataframe. We can do the same thing to a `pandas` `DataFrame`.

**Example 6.** Create a boolean mask that checks if the value in the `country` column is equal to "United States of America". This returns a `Series` of `True`/`False` values, where `True` indicates a match with the specified country.

In [None]:
ss['country'] == "United States of America"

**Example 7.** Assign a boolean mask that checks if the values in the `country` column are equal to "United States of America". This mask will be a `Series` of `True`/`Fals`e values, where `True` indicates rows that match the specified country.

In [None]:
mask = ...

**Example 8.** Filter the `ss` `DataFrame` using the boolean mask. This will return a new `DataFrame` containing only the rows where the mask is `True`, filtering the `DataFrame` based on the condition defined in the `mask`.

In [None]:
...

Keep in mind that our dataset contains some data entry inconsistencies, where USA, US, and United States of America are listed separately, even though they all refer to the same country. To ensure we capture all skyscrapers in the USA, we need to use the logical **`OR`** operator (`|`) to filter for all variations of the country name.

**Note:** If we need to meet multiple conditions, we use the logical **`AND`** operator (`&`). For example, to filter and select skyscrapers located in Chicago, United States of America, we would use the following syntax:

```python
(ss['country'] == "United States of America") & (ss['city'] == "Chicago")
```

**Example 9.** Create a boolean mask that checks if the value in the `country` column is either "United States of America" or "USA". The `|` operator is used to apply the logical `OR`, meaning the mask will return `True` if either condition is met.

In [None]:
(ss['country'] == "United States of America") | (ss['country'] == "USA")

In [None]:
mask = (
    (ss['country'] == "United States of America") | 
    (ss['country'] == "USA") |
    (ss['country'] == "United Sates of America")
)

ss[mask]

When using logical operations or expressions over multiple lines, parentheses are often used to group expressions into a block, like this:

```python
       ## Outer parenthesis to start the block
mask = ( 
    (ss['country'] == "United States of America") | 
    (ss['country'] == "USA") |
    (ss['country'] == "United Sates of America")

## Outer parenthesis to end the block
)
```

The parentheses define the boundaries of the block. This allows Python to interpret the entire multi-line expression as a single logical unit, even though it’s split across multiple lines. Without the parentheses, Python would not understand how to group the expressions and might raise a syntax error.

**Example 10.** Use this `.isin()` method to filter `ss` for all variations of the country name, checking if the country column is equal to "United States of America", "USA", or "United Sates of America". 

In [None]:
usa = ["United States of America", "USA", "United Sates of America"]

mask = ...

ss[mask]

You can also filter a `DataFrame` using the `.query` method. The `.query` method allows you to specify conditions as a string, making the syntax simpler and avoiding the need for complex boolean operators like `&` (and) and `|` (or). This method is especially useful when filtering based on multiple conditions.

**Example 11.** Write a query string to filter the `ss` `DataFrame` for rows where the `country` column matches "United States of America", "USA", or the misspelled "United Sates of America". Use the `.query()` method to apply this filter and return only the matching rows.

In [None]:
q = 'country == "United States of America" ...'

ss.query(q)

**Example 12.** Use string concatenation to build the query across multiple lines for readability. Then, apply the `.query()` method to return the filtered `DataFrame`.

In [None]:
q = 'country == "United States of America" or ' + \
    'country == "USA" or ' + \
    'country == "United Sates of America"'

q

In [None]:
ss.query(q)