# Data Processing for Machine Learning: Data Wrangling and Preprocessing

## 1. Data Wrangling and Preprocessing


### What is Data Wrangling?

Data wrangling, also known as data preprocessing, is the process of cleaning, transforming, and preparing raw data for analysis. In machine learning, it is crucial to ensure that the data is in a suitable format for model building.

### Common Steps in Data Wrangling:

1. **Handling Missing Data**: Replace, remove, or impute missing values.
2. **Handling Outliers**: Detect and address outliers that may distort the model.
3. **Data Transformation**: Normalize, scale, or encode categorical features.
4. **Feature Engineering**: Create new features or modify existing ones to improve model performance.
5. **Data Splitting**: Split the dataset into training, validation, and test sets.

### Example: Handling Missing Values

We can use techniques like imputation to fill in missing values or drop rows/columns with missing data.
    

In [None]:

import pandas as pd
import numpy as np

# Example: Creating a DataFrame with missing values
data = {'Feature1': [1, 2, np.nan, 4, 5],
        'Feature2': [10, 20, 30, np.nan, 50]}
df = pd.DataFrame(data)

# Handling missing values by filling them with the mean
df_filled = df.fillna(df.mean())
df_filled
    


### Handling Outliers

Outliers are extreme values that deviate significantly from the rest of the data. We can detect outliers using statistical methods like Z-score or IQR (Interquartile Range).

### Example: Detecting Outliers Using Z-Score
    

In [None]:

from scipy import stats

# Example: Detecting outliers using Z-score
data_outliers = np.array([1, 2, 3, 4, 5, 100])
z_scores = np.abs(stats.zscore(data_outliers))
outliers = np.where(z_scores > 2)
outliers
    


### Data Transformation

Data transformation is often required to scale numerical features or encode categorical variables. For example, normalization scales values to a range between 0 and 1, while standardization centers the data around a mean of 0.

### Example: Normalization and Standardization
    

In [None]:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Example: Normalizing and Standardizing data
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(df_filled)

scaler_standard = StandardScaler()
data_standardized = scaler_standard.fit_transform(df_filled)

data_normalized, data_standardized
    


### Data Splitting

It is crucial to split your dataset into training, validation, and test sets. The training set is used to train the model, the validation set is used for model tuning, and the test set is used to evaluate final model performance.

### Example: Splitting Data into Train and Test Sets
    

In [None]:

from sklearn.model_selection import train_test_split

# Example: Splitting data into training and test sets
X = df_filled[['Feature1', 'Feature2']]
y = np.array([1, 0, 1, 0, 1])  # Example labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
    


### Applications in Machine Learning

- **Data Wrangling** is essential for cleaning and preparing data before model training.
- Proper **handling of missing data** and **outliers** can significantly improve model performance.
- **Data transformation** ensures features are in the right format for machine learning models.
- **Data splitting** helps assess model performance on unseen data.

    