# 📝 Complete Guide to Data Preparation A to Z

Welcome to the "Complete Guide to Data Preparation A to Z," your ultimate resource for mastering the critical techniques of data preparation in data science and analytics. This comprehensive guide is designed for anyone interested in ensuring their data is primed for analysis, from students just starting out in data science to seasoned analysts looking to refine their skills.

## What Will You Learn?

This guide covers a broad spectrum of topics crucial for effective data preparation, making sure you are well-equipped to handle any challenges in cleaning and organizing data. Here's a snapshot of what's included:

- **Data Cleaning Basics:** Learn the fundamental techniques for identifying and correcting inaccuracies, handling missing values, and dealing with outliers.
- **Data Transformation:** Explore methods for normalizing and standardizing data, converting data types, and creating new derived variables.
- **Data Reduction Techniques:** Understand how to simplify large datasets through dimensionality reduction techniques like PCA and feature selection.
- **Handling Categorical Data:** Master techniques for encoding categorical variables and managing categories with many levels.
- **Data Integrity Checks:** Discover how to implement checks to ensure the accuracy and consistency of your data throughout the analysis process.
- **Automating Data Preparation:** Learn about tools and scripts that can automate parts of the data preparation process, enhancing efficiency and reducing errors.
- **Advanced Data Cleaning:** Dive into more sophisticated data cleaning techniques, such as using algorithms to detect and correct data anomalies.

## Why This Guide?

- **Step-by-Step Instructions:** Each section is broken down into easy-to-follow steps, ensuring that you can apply the concepts you learn directly to your data.
- **Interactive Experience:** Engage with interactive code cells that allow you to see the effects of various data preparation techniques in real-time.

Prepare to dive deep into the world of data preparation, enhancing your ability to clean, organize, and transform data into a powerful asset for any analysis or machine learning project. Let’s get started!


## Data Cleaning Basics

In this section, we will explore basic data cleaning techniques using a synthetic dataset from "Mythical Creatures and Oddities Zoo". The dataset is purposefully flawed, containing common data issues such as inaccuracies, missing values, and outliers to facilitate our demonstration.

### Dataset Description
Our dataset includes various mythical creatures with their supposed weights, heights, ages, and other whimsical attributes. Some values are clearly exaggerated or incorrect, providing a great opportunity to practice cleaning data.

### Correcting Inaccuracies
We begin by identifying and correcting entries that don't make sense, such as negative values for weight or impossible age values for mythical creatures.

### Handling Missing Values
Next, we handle missing entries. For numeric fields like weight and height, we'll use median imputation, and for categorical fields like location, we might use mode imputation or other logical assumptions.

### Dealing with Outliers
Finally, we tackle outliers — those entries that stand out excessively and could skew our analysis. We'll use methods like IQR to identify and mitigate these values.


In [1]:
import pandas as pd
import numpy as np

# Create the synthetic dataset
data = {
    'Creature ID': [f"{i:03}" for i in range(1, 21)],
    'Species': ['Dragon', 'Unicorn', 'Mermaid', 'Centaur', 'Phoenix', 'Dragon', 'Unicorn', 'Mermaid', 'Centaur', 'Phoenix', 
                'Dragon', 'Unicorn', 'Mermaid', 'Centaur', 'Phoenix', 'Dragon', 'Unicorn', 'Mermaid', 'Centaur', 'Phoenix'],
    'Weight (kg)': [4500, 500, 150, 350, 15, -3800, 490, 160, 340, 12, 4600, 510, 'MISSING', 360, 'MISSING', -4200, 530, 140, 355, 10],
    'Height (cm)': [350, 180, 160, 200, 45, 330, 175, 155, 205, 50, 370, 185, 165, 'OUTLIER', 40, 360, 170, 'MISSING', 210, 55],
    'Age (Years)': [305, 120, 80, 75, 500, 310, 115, 85, 70, 520, 300, 125, 78, 72, 510, 320, 110, 82, 74, 505],
    'Location': ['Mountain Peak', 'Enchanted Forest', 'Ocean Abyss', 'Ancient Greece', 'Sky Realm', 'Mountain Peak', 
                 'Enchanted Forest', 'Ocean Abyss', 'Ancient Greece', 'Sky Realm', 'Mountain Peak', 'Enchanted Forest', 
                 'Ocean Abyss', 'Ancient Greece', 'Sky Realm', 'Mountain Peak', 'Enchanted Forest', 'Ocean Abyss', 
                 'Ancient Greece', 'Sky Realm']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Correcting inaccuracies: fix negative weights
df['Weight (kg)'] = df['Weight (kg)'].apply(pd.to_numeric, errors='coerce').abs()
print("\nDataFrame after correcting negative weights:")
print(df)

# Handling missing values: impute median for numeric columns
numeric_columns = ['Weight (kg)', 'Height (cm)', 'Age (Years)']
for column in numeric_columns:
    df[column] = pd.to_numeric(df[column], errors='coerce')
    df.loc[:, column] = df[column].fillna(df[column].median())

print("\nDataFrame after handling missing numeric values:")
print(df)

# Handling missing values in 'Location'
location_mode = df['Location'].mode()[0]
df.loc[:, 'Location'] = df['Location'].fillna(location_mode)
print("\nDataFrame after handling missing categorical values (Location):")
print(df)

# Dealing with outliers using IQR for Height
Q1 = df['Height (cm)'].quantile(0.25)
Q3 = df['Height (cm)'].quantile(0.75)
IQR = Q3 - Q1
outlier_condition = (df['Height (cm)'] < (Q1 - 1.5 * IQR)) | (df['Height (cm)'] > (Q3 + 1.5 * IQR))
df.loc[outlier_condition, 'Height (cm)'] = np.nan
df.loc[:, 'Height (cm)'] = df['Height (cm)'].fillna(df['Height (cm)'].median())

print("\nDataFrame after handling outliers in Height:")
print(df)


Original DataFrame:
   Creature ID  Species Weight (kg) Height (cm)  Age (Years)          Location
0          001   Dragon        4500         350          305     Mountain Peak
1          002  Unicorn         500         180          120  Enchanted Forest
2          003  Mermaid         150         160           80       Ocean Abyss
3          004  Centaur         350         200           75    Ancient Greece
4          005  Phoenix          15          45          500         Sky Realm
5          006   Dragon       -3800         330          310     Mountain Peak
6          007  Unicorn         490         175          115  Enchanted Forest
7          008  Mermaid         160         155           85       Ocean Abyss
8          009  Centaur         340         205           70    Ancient Greece
9          010  Phoenix          12          50          520         Sky Realm
10         011   Dragon        4600         370          300     Mountain Peak
11         012  Unicorn         

## Note on Updates
​
This notebook is a work in progress and will be updated over time. Please check back regularly to see the latest additions and enhancements.