<a href="https://colab.research.google.com/github/krauseannelize/nb-py-ms-exercises/blob/sprint04/notebooks/s04_pandas_data_wrangling/43_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 43 | Data Cleaning

Data cleaning is the process of preparing raw data for analysis by correcting errors, inconsistencies, and inaccuracies. A clean dataset helps prevent misleading results, enhances the performance of machine learning models, and improves the overall efficiency of data processing pipelines.

## Addressing Missing Values in DataFrames

Handling missing data is essential for maintaining the integrity of your analysis. Two common strategies include:

- **Deletion**: Removing rows or columns with missing values.
- **Imputation**: Replacing missing values with estimated values to preserve the dataset's structure.

### Deletion

The `.dropna()` method removes rows or columns containing null values. Use the `axis` parameter to specify the direction:

- `axis=0`: Delete **rows** with at least one null value  
- `axis=1`: Delete **columns** with at least one null value


In [None]:
import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2.0, 3.0, 4.0],
    'C': [1.0, np.nan, np.nan, 4.0],
    'D': [1, 2, 3, 4]
})

print("Original DataFrame:\n")
df

Original DataFrame:



Unnamed: 0,A,B,C,D
0,1.0,,1.0,1
1,2.0,2.0,,2
2,,3.0,,3
3,4.0,4.0,4.0,4


In [None]:
# Dropping rows with missing values
drop_rows = df.dropna(axis=0)
print("\nDataFrame after dropping rows with missing values:\n")
drop_rows


DataFrame after dropping rows with missing values:



Unnamed: 0,A,B,C,D
3,4.0,4.0,4.0,4


In [None]:
# Dropping columns with missing values
drop_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:\n")
drop_cols


DataFrame after dropping columns with missing values:



Unnamed: 0,D
0,1
1,2
2,3
3,4


### Imputation

Imputation involves replacing missing values with estimated ones to preserve the completeness of the dataset. The `.fillna()` method is used to replace missing values (NaN) in a DataFrame or Series with a specified value.

Common techniques include:

- **Fixed Fill**:  
  Replace missing values with a fixed value
  ```python
  # Basic syntax
  df['Column_name'].fillna('Replace_with')
  ```
- **Mean Imputation**:  
  Replace missing values in numeric columns with the mean of the non-null values.
  ```python
  # Basic syntax
  df['Column_name'].fillna(df['Column_name'].mean())
  ```

- **Mode Imputation**:  
  Replace missing values in categorical columns with the most frequent value (mode).
  ```python
  # Basic syntax
  # [0] selects first value from Series of most frequent values
  df['Column_name'].fillna(df['Column_name'].mode()[0])
  ```


In [None]:
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'A': [1.0, 2.0, np.nan, 4.0, 5.0],
    'B': [5, 6, 7, 8, 9]
})

print("Original DataFrame:\n")
df

Original DataFrame:



Unnamed: 0,A,B
0,1.0,5
1,2.0,6
2,,7
3,4.0,8
4,5.0,9


In [None]:
# Use Mean Imputation to replace missing values
df['A'] = df['A'].fillna(df['A'].mean()) # Assign the result back
print("\nDataFrame after Mean Imputation:\n")
df


DataFrame after Mean Imputation:



Unnamed: 0,A,B
0,1.0,5
1,2.0,6
2,3.0,7
3,4.0,8
4,5.0,9


In [None]:
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'Category': ['A', 'B', 'A', np.nan, 'B', 'A', np.nan],
    'Value': [10, 20, 10, 30, 20, 10, 40]
})

print("Original DataFrame:\n")
df

Original DataFrame:



Unnamed: 0,Category,Value
0,A,10
1,B,20
2,A,10
3,,30
4,B,20
5,A,10
6,,40


In [None]:
# Use Mode Imputation to replace missing values
df['Category'] = df['Category'].fillna(df['Category'].mode()[0])
print("\nDataFrame after Mode Imputation:\n")
df


DataFrame after Mode Imputation:



Unnamed: 0,Category,Value
0,A,10
1,B,20
2,A,10
3,A,30
4,B,20
5,A,10
6,A,40


### Choosing Between Deletion and Imputation

- **Extent of Missingness**: Use deletion for minimal missing data; use imputation when a significant portion of the data is missing.
- **Importance of Variables**: If a variable with missing values is critical, imputation is preferable to avoid losing information.
- **Nature of Data**: Consider whether the data is categorical or continuous, as this affects the choice of imputation technique.
- **Analysis Requirements**: For machine learning, imputing values is often necessary, as algorithms usually cannot handle missing data directly.

## Handling Duplicates

Duplicates can distort analysis and lead to misleading insights. Identifying and resolving them is essential for maintaining a clean and reliable dataset.

The `drop_duplicates()` method removes duplicate rows based on all columns by default. By using the `subset` parameter, you can specify which columns to consider when identifying duplicates.

In [None]:
import pandas as pd

# Creating DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Alice', 'Bob', 'Alice'],
    'Age': [25, 30, 25, 25, 30, 25],
    'City': ['New York', 'Los Angeles', 'New York', 'New York', 'Los Angeles', 'New York'],
    'Country': ['USA', 'USA', 'USA', 'Canada', 'USA', 'USA']
})

# Display DataFrame before changes
print("Original DataFrame:\n")
df

Original DataFrame:



Unnamed: 0,Name,Age,City,Country
0,Alice,25,New York,USA
1,Bob,30,Los Angeles,USA
2,Alice,25,New York,USA
3,Alice,25,New York,Canada
4,Bob,30,Los Angeles,USA
5,Alice,25,New York,USA


In [None]:
# Drop all duplicate rows from dataset without a subset
clean_df = df.drop_duplicates()
print("\nDataFrame after dropping duplicates:\n")
clean_df


DataFrame after dropping duplicates:



Unnamed: 0,Name,Age,City,Country
0,Alice,25,New York,USA
1,Bob,30,Los Angeles,USA
3,Alice,25,New York,Canada


In [None]:
# Drop duplicate rows based on Name, Age and City
clean_df = df.drop_duplicates(subset=['Name', 'Age', 'City'])
print("\nDataFrame after dropping duplicates based on Name, Age and City:\n")
clean_df


DataFrame after dropping duplicates based on Name, Age and City:



Unnamed: 0,Name,Age,City,Country
0,Alice,25,New York,USA
1,Bob,30,Los Angeles,USA


## Data Type Conversion with Python

The `astype()` method, with the desired data type passed as an argument, allows you to convert a column to a different type to ensure consistency and enabling appropriate operations on the data.

In [2]:
import pandas as pd

# Creating a DataFrame with numbers stored as strings
df = pd.DataFrame({
    'A': ['1', '4'],
    'B': ['2', '5'],
    'C': ['3', '6'],
    'D': ['7', '8']
})

# Confirm datatype is string
df.dtypes

Unnamed: 0,0
A,object
B,object
C,object
D,object


In [3]:
# Convert all columns to integers
df = df.astype(int)
df.dtypes

Unnamed: 0,0
A,int64
B,int64
C,int64
D,int64


## Cleaning Strings

Pandas provides powerful string methods to clean and standardize your data efficiently such as:

| Issue | Fix | Example |
| --- | --- | --- |
| Correcting capitalization | `.str.title()` | `"john doe"` → `"John Doe"` |
| Stripping whitespaces | `.str.strip()` | `"  hello "` → `"hello"` |
| Column value splitting | `.str.split()` | `"John,Doe"` → `["John", "Doe"]` |
| Concatenating strings | `+` or `.str.cat()` | `"John" + " " + "Doe"` → `"John Doe"` |
| Slicing strings | `.str[:]` | `"abcdef".str[:3]` → `"abc"` |
| Replacing characters | `.str.replace()` | `"hello world".str.replace(" ", "_")` → `"hello_world"` |

_Note: All string methods must be accessed via `.str` when applied to a Pandas Series._

## Exercise 1

This challenge is for you to practice handling missing values. Having seen the criteria that should guide you in removing null values at the row or column level, implement the method in Python with a configuration that allows for the elimination of null values while losing the least amount of records. As always, replace the “___”.

```python
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan,
'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

# Display the dataset
print("Original Data:\n", df)

# Task 1: Fill missing numeric values with the mean of the column
df['Age'].fillna(___, inplace=True)
df['Salary'].fillna(___, inplace=True)

# Task 2: Fill missing categorical values with the mode
df['City'].fillna(___[0], inplace=True)

print("\nData after filling missing values:\n", df)
```

In [None]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan],
    'City': ['New York', 'Los Angeles', 'New York', np.nan,
'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

# Display the dataset
print("Original Data:\n", df)

# Task 1: Fill missing numeric values with the mean of the column
# Note: Not using inplace=True as chained assignment with inplace is being deprecated
# in future versions of pandas. Using direct assignment instead.

df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Task 2: Fill missing categorical values with the mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

print("\nData after filling missing values:\n", df)

Original Data:
     Age   Salary           City
0  25.0  50000.0       New York
1   NaN  60000.0    Los Angeles
2  35.0      NaN       New York
3  45.0  80000.0            NaN
4   NaN  90000.0  San Francisco
5  50.0      NaN        Chicago

Data after filling missing values:
      Age   Salary           City
0  25.00  50000.0       New York
1  38.75  60000.0    Los Angeles
2  35.00  70000.0       New York
3  45.00  80000.0       New York
4  38.75  90000.0  San Francisco
5  50.00  70000.0        Chicago


## Exercise 2

You have a dataset with missing numeric values in the `Age` and `Salary` columns. Apply mean imputation to replace these missing values. Make sure to replace the missing `Age` values with the mean of the `Age` column and the missing `Salary` values with the mean of the `Salary` column.

```python
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan]
}

df = pd.DataFrame(data)

# Fill missing numeric values with the mean of the column
df['___'].fillna(___, inplace=True)  # Replace missing Age values with ___
df['___'].fillna(___, inplace=True)  # Replace missing Salary values with ___

print(df)
```

In [None]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'Age': [25, np.nan, 35, 45, np.nan, 50],
    'Salary': [50000, 60000, np.nan, 80000, 90000, np.nan]
}

df = pd.DataFrame(data)

# Fill missing numeric values with the mean of the column
# Note: Not using inplace=True as chained assignment with inplace is being deprecated
# in future versions of pandas. Using direct assignment instead.
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print(df)

     Age   Salary
0  25.00  50000.0
1  38.75  60000.0
2  35.00  70000.0
3  45.00  80000.0
4  38.75  90000.0
5  50.00  70000.0


## Exercise 3

You have a dataset with missing categorical values in the `City` column. Apply mode imputation to replace these missing values with the most frequent value (mode) in the `City` column.

```python
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

# Fill missing categorical values with the mode of the column
df['___'].fillna(___, inplace=___)  # Replace missing City values with ___

print(df)
```

In [None]:
import pandas as pd
import numpy as np

# Sample dataset
data = {
    'City': ['New York', 'Los Angeles', 'New York', np.nan, 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)

# Fill missing categorical values with the mode of the column
# Note: Not using inplace=True as chained assignment with inplace is being deprecated
# in future versions of pandas. Using direct assignment instead.
df['City'] = df['City'].fillna(df['City'].mode()[0])

print(df)

            City
0       New York
1    Los Angeles
2       New York
3       New York
4  San Francisco
5        Chicago


## Exercise 4

You have a dataset with some rows containing missing values. Use row deletion to remove all rows that have at least one missing value. Apply the `dropna()` method to drop rows with null values.

```python
import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, 3, 4]
}

df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna(___)  # Remove rows with ___ missing values

print(df)
```

In [None]:
import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, 3, 4]
}

df = pd.DataFrame(data)

# Drop rows with missing values
df = df.dropna(axis=0)

print(df)

     A    B    C
3  4.0  4.0  4.0


## Exercise 5

You have a dataset with some columns containing missing values. Use column deletion to remove all columns that have at least one missing value. Apply the `dropna()` method with the axis=1 parameter to drop columns with null values.

```python
import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, 3, 4],
    'D': [1, 2, 3, 4]
}

df = pd.DataFrame(data)

# Drop columns with missing values
___= df.___(axis=___)  # Remove columns with ___ missing values

print(df)
```

In [None]:
import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 2, 3, 4],
    'C': [1, np.nan, 3, 4],
    'D': [1, 2, 3, 4]
}

df = pd.DataFrame(data)

# Drop columns with missing values
df = df.dropna(axis=1)

print(df)

   D
0  1
1  2
2  3
3  4


## Exercise 6

Drop the duplicate rows considering the “Name” and “City” columns. Replace the “___”.

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Charlie', 'David'],
    'Age': [30, 31, 32, 33, 25, 40, 35],
    'City': ['New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 72000, 71000, 73000, 80000, 120000, 90000]
}

df = pd.DataFrame(data)

# Drop duplicates based on 'Name' and 'City'

df_dropped_duplicates = df.drop_duplicates(___)  # Replace the "___"

print("\nDataFrame after dropping duplicates based on 'Name' and 'City':")
print(df_dropped_duplicates)
```

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Charlie', 'David'],
    'Age': [30, 31, 32, 33, 25, 40, 35],
    'City': ['New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 72000, 71000, 73000, 80000, 120000, 90000]
}

df = pd.DataFrame(data)

# Drop duplicates based on 'Name' and 'City'

df_dropped_duplicates = df.drop_duplicates(subset=['Name', 'City'])

print("\nDataFrame after dropping duplicates based on 'Name' and 'City':")
print(df_dropped_duplicates)


DataFrame after dropping duplicates based on 'Name' and 'City':
      Name  Age         City  Salary
0    Alice   30     New York   70000
4      Bob   25  Los Angeles   80000
5  Charlie   40      Chicago  120000
6    David   35      Houston   90000


## Exercise 7

You have a dataset with some duplicate rows. Use the `drop_duplicates()` method to remove all duplicate rows, keeping only unique records in the DataFrame.

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
}

df = pd.DataFrame(data)

# Remove all duplicate rows
df = df.___()  # Replace "___" with the correct method

print(df)
```

In [1]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
}

df = pd.DataFrame(data)

# Remove all duplicate rows
df = df.drop_duplicates()

print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
3  Charlie   35      Chicago


## Exercise 8

You want to remove duplicates based on the `Name` and `City` columns. Use the `drop_duplicates()` method with the `subset` parameter to drop rows that have the same `Name` and `City`.

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles'],
    'Salary': [70000, 80000, 71000, 120000, 80000]
}

df = pd.DataFrame(data)

# Drop duplicates based on 'Name' and 'City'
df = df.drop_duplicates(subset=['___', '___'])  # Replace "___" with the correct parameter

print(df)
```

In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles'],
    'Salary': [70000, 80000, 71000, 120000, 80000]
}

df = pd.DataFrame(data)

# Drop duplicates based on 'Name' and 'City'
df = df.drop_duplicates(subset=['Name', 'City'])

print(df)

# Exercise 9

Let’s say we want to calculate the average weight, height and speed of the members of the following DataFrame. Replace the “___” in order to be able to do that.

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Weight': ['55.5', '70.2', '66.3', '80.4'],  
    'Height': [1.65, 1.75, 1.80, 1.70],  
    'Speed': [10.5, 12.0, 11.3, 9.8]
}

df = pd.DataFrame(data)

df['___'] = df['___'].___(float) # replace the "___"
print(df['Speed'].mean()) # replace the "___"
print(df['___'].mean()) # replace the "___"
print(df['___'].mean()) # replace the "___"
```

In [4]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Weight': ['55.5', '70.2', '66.3', '80.4'],
    'Height': [1.65, 1.75, 1.80, 1.70],
    'Speed': [10.5, 12.0, 11.3, 9.8]
}

df = pd.DataFrame(data)

df[['Weight', 'Height', 'Speed']] = df[['Weight', 'Height', 'Speed']].astype(float)
print(df['Speed'].mean())
print(df['Weight'].mean())
print(df['Height'].mean())

10.899999999999999
68.1
1.725


## Exercise 10

You have a dataset where a numeric column is stored as a string. Use the `astype()` method to convert this column to the appropriate numeric type (`float`), then calculate the mean of that column.

```python
import pandas as pd

# Sample dataset with numeric values as strings
data = {
    'Values': ['10', '20', '30', '40']
}

df = pd.DataFrame(data)

# Convert the 'Values' column to float type and calculate the mean
df['Values'] = df['___'].___(___)  # Replace "___" with the correct method

average = df['Values'].___()
print(average)
```

In [6]:
import pandas as pd

# Sample dataset with numeric values as strings
data = {
    'Values': ['10', '20', '30', '40']
}

df = pd.DataFrame(data)

# Convert the 'Values' column to float type and calculate the mean
df['Values'] = df['Values'].astype(float)

average = df['Values'].mean()
print(average)

25.0


## Exercise 11

You have a dataset where the `Weight` column is stored as strings. Convert this column to `float` using the `astype()` method, and then calculate the mean of the `Weight`, `Height`, and `Speed` columns.

```python
import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Weight': ['55.5', '70.2', '66.3', '80.4'],
    'Height': [1.65, 1.75, 1.80, 1.70],
    'Speed': [10.5, 12.0, 11.3, 9.8]
}

df = pd.DataFrame(data)

# Convert 'Weight' column to float
df['Weight'] = df['___'].____  # Replace "___" with the correct method

# Calculate and print the mean of 'Speed', 'Weight', and 'Height'
print(df['___'].mean())
print(df['___'].mean())  # Replace the second "___" with 'Weight'
print(df['___'].mean())  # Replace the third "___" with 'Height'
```

In [7]:
import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Weight': ['55.5', '70.2', '66.3', '80.4'],
    'Height': [1.65, 1.75, 1.80, 1.70],
    'Speed': [10.5, 12.0, 11.3, 9.8]
}

df = pd.DataFrame(data)

# Convert 'Weight' column to float
df['Weight'] = df['Weight'].astype(float)

# Calculate and print the mean of 'Speed', 'Weight', and 'Height'
print(df['Speed'].mean())
print(df['Weight'].mean())
print(df['Height'].mean())

10.899999999999999
68.1
1.725


## Exercise 12

You have a dataset with a column containing dates stored as strings. Convert this column to `datetime` format using `pd.to_datetime()`, and then calculate the difference between two dates.

```python
import pandas as pd

# Sample dataset with date column as strings
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'DateOfBirth': ['1990-01-01', '1985-05-15', '2000-07-20']
}

df = pd.DataFrame(data)

# Convert the 'DateOfBirth' column to datetime format
df['DateOfBirth'] = pd.___(df['DateOfBirth'])  # Replace "___" with the correct method

# Calculate and print the difference between two dates (e.g., the date today and the date of birth)
date_diff = pd.Timestamp.today() - df['DateOfBirth']
print(___)
```

In [8]:
import pandas as pd

# Sample dataset with date column as strings
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'DateOfBirth': ['1990-01-01', '1985-05-15', '2000-07-20']
}

df = pd.DataFrame(data)

# Convert the 'DateOfBirth' column to datetime format
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])

# Calculate and print the difference between two dates (e.g., the date today and the date of birth)
date_diff = pd.Timestamp.today() - df['DateOfBirth']
print(date_diff)

0   13043 days 12:23:08.636691
1   14735 days 12:23:08.636691
2    9190 days 12:23:08.636691
Name: DateOfBirth, dtype: timedelta64[ns]


## Exercise 13

Now is your turn to clean the Data. Replace the “___” in order to get your data clean!

```python
import pandas as pd

data = {'Product': ['   Apple    ', 'banana', '  ORANGE  ', 'kiwifruit '
        , 'grapes'], 'Sales': ['100$', ' 200$ ', '  150$ ', '120$  ',
        '  90$ '], 'Place': ['North America - USA',
        'South America - Brazil', 'Europe - France', 'Asia - Japan',
        'Australia - Australia']}

sales_df = pd.DataFrame(data)

# Correct product capitalization

sales_df['Product'] = sales_df['___'].str.capitalize() # Replace the "___"

# Stripp the whitespaces of all the columns

sales_df['Product'] = sales_df['___'].str.___() # Replace the "___"
sales_df['Sales'] = sales_df['____'].str.___() # Replace the "___"
sales_df['Place'] = sales_df['___'].str.___() # Replace the "___"

# Splitting 'Place' into new columns: 'Region' and 'Country'

sales_df[['___', '___']] = sales_df['Place'].str.split('___',
        expand=True) # Replace the "___"

# Replacing characters (e.g., replacing "$" with " ")

sales_df['Sales'] = sales_df['Sales'].str.replace('___', '___')

print('\nSales DataFrame after String Cleaning Operations:')
print(sales_df)
```

In [9]:
import pandas as pd

data = {'Product': ['   Apple    ', 'banana', '  ORANGE  ', 'kiwifruit '
        , 'grapes'], 'Sales': ['100$', ' 200$ ', '  150$ ', '120$  ',
        '  90$ '], 'Place': ['North America - USA',
        'South America - Brazil', 'Europe - France', 'Asia - Japan',
        'Australia - Australia']}

sales_df = pd.DataFrame(data)

# Correct product capitalization

sales_df['Product'] = sales_df['Product'].str.capitalize()

# Strip the whitespaces of all the columns

sales_df['Product'] = sales_df['Product'].str.strip()
sales_df['Sales'] = sales_df['Sales'].str.strip()
sales_df['Place'] = sales_df['Place'].str.strip()

# Splitting 'Place' into new columns: 'Region' and 'Country'

sales_df[['Region', 'Country']] = sales_df['Place'].str.split(' - ', expand=True)

# Replacing characters (e.g., replacing "$" with " ")

sales_df['Sales'] = sales_df['Sales'].str.replace('$', '')

print('\nSales DataFrame after String Cleaning Operations:')
print(sales_df)


Sales DataFrame after String Cleaning Operations:
     Product Sales                   Place         Region    Country
0      apple   100     North America - USA  North America        USA
1     Banana   200  South America - Brazil  South America     Brazil
2     orange   150         Europe - France         Europe     France
3  Kiwifruit   120            Asia - Japan           Asia      Japan
4     Grapes    90   Australia - Australia      Australia  Australia


## Exercise 14

You have a dataset with names in the `Names` column, and you want to check if there are any duplicate names in the column. Then apply a function to ensure the capitalization is consistent (i.e., first letters of each name are capitalized).

```python
import pandas as pd

# Sample dataset
data = {
    'Names': ['john doe', 'Jane Smith', 'james BOND', 'john doe', 'Max Power'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Check for unique values in the 'Names' column
print(df['____'].____())  # Use the "___" function to get unique names

# Correct capitalization in the 'Names' column
df['____'] = df['____'].str.____()  # Use the "___" method to correct capitalization

print(df['Names'].to_list())
```

In [10]:
import pandas as pd

# Sample dataset
data = {
    'Names': ['john doe', 'Jane Smith', 'james BOND', 'john doe', 'Max Power'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Check for unique values in the 'Names' column
print(df['Names'].unique())

# Correct capitalization in the 'Names' column
df['Names'] = df['Names'].str.title()

print(df['Names'].to_list())

['john doe' 'Jane Smith' 'james BOND' 'Max Power']
['John Doe', 'Jane Smith', 'James Bond', 'John Doe', 'Max Power']


## Exercise 15

You have a dataset where the `Names` column contains leading or trailing whitespaces. Additionally, you want to check if any names contain a specific unwanted character (e.g., an underscore "_"). Clean the data.

```python
 import pandas as pd

# Sample dataset
data = {
    'Names': ['  john doe', '  Jane_Smith', 'james BOND   ', '   Max Power   ', 'Sarah O\'Connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Strip leading and trailing whitespaces from 'Names' column
df['Names'] = df['Names'].___.___()  # Use the "___" method to strip whitespaces

# Check if any names contain an underscore
contains_underscore = df['____'].str.____('_').any()  # Replace "___" with the correct method to check for underscores
print(f"Do any names contain an underscore? {contains_underscore}")

print(df['Names'].to_list())
```

In [14]:
import pandas as pd

# Sample dataset
data = {
    'Names': ['  john doe', '  Jane_Smith', 'james BOND   ', '   Max Power   ', 'Sarah O\'Connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Strip leading and trailing whitespaces from 'Names' column
df['Names'] = df['Names'].str.strip()

# Check if any names contain an underscore
contains_underscore = df['Names'].str.contains('_').any()
print(f"Do any names contain an underscore? {contains_underscore}")

print(df['Names'].to_list())

Do any names contain an underscore? True
['john doe', 'Jane_Smith', 'james BOND', 'Max Power', "Sarah O'Connor"]


## Exercise 16

The `Names` column contains both first and last names in a single string. Split the `Names` column into `First_Name` and `Last_Name`. After splitting, drop the original `Names` column.

```python
import pandas as pd

# Sample dataset
data = {
    'Names': ['john doe', 'Jane Smith', 'james BOND', 'Max Power', 'Sarah O\'Connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Split the 'Names' column into 'First_Name' and 'Last_Name'
df[['___', '___']] = df['Names'].____.____(' ', n=1, expand=True)  # Use the "___" function to split by space
df = df.___(columns=['Names'])

print(df)
```

In [12]:
import pandas as pd

# Sample dataset
data = {
    'Names': ['john doe', 'Jane Smith', 'james BOND', 'Max Power', 'Sarah O\'Connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Split the 'Names' column into 'First_Name' and 'Last_Name'
# n=1 → split only once
# expand=True → result should be returned as a DataFrame
df[['First_Name', 'Last_Name']] = df['Names'].str.split(' ', n=1, expand=True)
df = df.drop(columns=['Names'])

print(df)

           Comments     Salary First_Name Last_Name
0        Great job!     $50000       john       doe
1         good work      60000       Jane     Smith
2     excellent job  $70000.50      james      BOND
3     Nice teamwork      80000        Max     Power
4  poor performance     $90000      Sarah  O'Connor


## Exercise 17

You have a dataset with names in the `Names` column, and you want to convert all names to lowercase.

```python
import pandas as pd

# Sample dataset
data = {
    'Names': ['John Doe', 'jane smith', 'JAMES bond', 'Max Power', 'sarah o\'connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Convert all names to lowercase
df['Names'] = ___  # Replace "___" with the correct method

print(df['___'].to_list())
```

In [13]:
import pandas as pd

# Sample dataset
data = {
    'Names': ['John Doe', 'jane smith', 'JAMES bond', 'Max Power', 'sarah o\'connor'],
    'Comments': ['Great job!', 'good work', 'excellent job', 'Nice teamwork', 'poor performance'],
    'Salary': ['$50000', '60000', '$70000.50', '80000', '$90000']
}

df = pd.DataFrame(data)

# Convert all names to lowercase
df['Names'] = df['Names'].str.lower()

print(df['Names'].to_list())

['john doe', 'jane smith', 'james bond', 'max power', "sarah o'connor"]
