# 3.2 Data Cleaning
## 3.2.1 Getting Started with pandas

Display the first few rows of a DataFrame:

```python
import pandas as pd

# Read the input data
dataframe = pd.read_csv('./Sales Transaction v.4a.csv')
```

Display the data in a transposed format:

```python
# Display dataframe's head
dataframe.head().tranpose()
```

Get DataFrame dimensions:

```python
# Get dataframe dimensions
dataframe.shape
>>> (536350, 8)

# Display the number of rows
dataframe.shape[0]
>>> 536350
```

Get and modify column names:

```python
# Print dataframe columns
dataframe.columns
>>> Index(['TransactionNo', 'Date', 'ProductNo', 'ProductName', 'Price',
       'Quantity', 'CustomerNo', 'Country'],
      dtype='object')

# Clean column names
dataframe.columns = [col.lower().replace(' ', '_') for col in dataframe.columns]
```

## 3.2.2 Data Cleaning
### 3.2.2.1 Handling Duplicates

Check for duplicates in the DataFrame:

```python
import pandas as pd

# Read dataset
dataframe = pd.read_csv('./Sales Transaction v.4a.csv')

# Print number of duplicated rows
print(dataframe.duplicated().sum())
>>> 5200
```

Remove duplicates keeping the last occurrence:

```python
# Drop duplicates and keep last 
dataframe = dataframe.drop_duplicates(keep='last')
```

### 3.2.2.2 Missing Values
#### 3.2.2.2.1 What is a Missing Value?
#### 3.2.2.2.2 Identify Missing Values

Print DataFrame information and calculate missing value percentages:

```python
import pandas as pd

dataframe = pd.read_csv('./Sales Transaction v.4a.csv')

# Define function to calculate NaN percentage per column
def calculate_nan_percentage_per_column(df):
    # Calculate the total number of elements in the DataFrame
    total_elements = df.size
    nan_count = df.isnull().sum()
    return nan_count/total_elements*100

print(calculate_nan_percentage_per_column(dataframe))
```

Visualize missing values with missingno:

```python
import pandas as pd
import missingno as msno

# Read a toy dataset from missingno Github repository
collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

# Display 250 randoms rows
msno.matrix(collisions.sample(250))
```

#### 3.2.2.2.3 Delete Missing Values

Delete rows with missing values:

```python
# Delete all rows with at least one NaN
dataframe_cleaned = dataframe.dropna()

# Delete all rows with at least one NaN in one column
dataframe_cleaned = dataframe.dropna(axis=0, how='any')

# Only delete NaN from TransactionNo column
dataframe_cleaned = dataframe.dropna(subset=['TransactionNo'])
```

#### 3.2.2.2.4 Impute Missing Values

Simple imputation methods:

```python
import pandas as pd

dataframe = pd.read_csv('./Sales Transaction v.4a.csv')

# Fill NaN with 0
zero_imputation = dataframe.fillna(0)

# Fill NaN with mean value of each column
mean_value = dataframe.mean()
mean_imputation = dataframe.fillna(mean_value)
```

Multiple imputation using miceforest:

```python
from miceforest import ImputationKernel

mice_kernel = ImputationKernel(data=dataframe,
                               save_all_iterations=True,
                               random_state=42)

mice_kernel.mice(2)
mice_imputation = mice_kernel.complete_data()
```