<a href="https://colab.research.google.com/github/harshavardhanSDE/data-tools/blob/main/Data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools
> Importing libraries <br>
> Importing Dataset <br>
> Taking care of Missing data <br>
> Encoding categorical data <br>

> > Encoding the independent variable <br>
> > Encoding the dependent variable <br>

> Splitting data in to training and test set <br>
> Feature scaling

# Data preprocessing `methods` avilable in `Pandas`.

Pandas is a popular Python library for data manipulation and analysis. It provides a wide range of functions and methods for data preprocessing. Here are some of the commonly used data preprocessing functions available in Pandas:

1. **Loading Data:**
   - `pd.read_csv()`: Read data from CSV files.
   - `pd.read_excel()`: Read data from Excel files.
   - `pd.read_sql()`: Read data from SQL databases.
   - `pd.read_json()`: Read data from JSON files.

2. **Handling Missing Data:**
   - `df.dropna()`: Drop rows or columns with missing values.
   - `df.fillna()`: Fill missing values with specified values or methods.
   - `df.interpolate()`: Interpolate missing values based on linear or polynomial methods.

3. **Data Transformation:**
   - `df.apply()`: Apply a function along rows or columns.
   - `df.map()`: Apply a function element-wise.
   - `df.replace()`: Replace values with other values.
   - `df.rename()`: Rename columns.
   - `df.sort_values()`: Sort DataFrame by values.
   - `df.groupby()`: Group data by specified columns and apply aggregation functions.

4. **Categorical Data Handling:**
   - `pd.get_dummies()`: Create one-hot encoded columns for categorical variables.
   - `df.astype()`: Convert data types of columns.
   - `df.cut()`: Bin continuous data into intervals.

5. **String Manipulation:**
   - `df.str.lower()`, `df.str.upper()`: Convert strings to lowercase/uppercase.
   - `df.str.strip()`: Remove leading and trailing whitespace.
   - `df.str.replace()`: Replace substrings in strings.

6. **Datetime Handling:**
   - `pd.to_datetime()`: Convert strings to datetime objects.
   - `df.resample()`: Resample time series data to different frequencies.
   - `df.shift()`: Shift data along the time axis.

7. **Data Aggregation and Transformation:**
   - `df.groupby()`: Group data and perform aggregation operations.
   - `df.pivot_table()`: Create pivot tables for summarizing data.
   - `df.melt()`: Convert wide data to long format.

8. **Combining Data:**
   - `pd.concat()`: Concatenate DataFrames along rows or columns.
   - `pd.merge()`: Merge DataFrames based on specified keys.

9. **Feature Scaling and Normalization:**
   - These operations can be performed using mathematical operations and broadcasting on DataFrame columns.

10. **Dropping Columns:**
    - `df.drop()`: Drop specified columns.

11. **Dropping Duplicate Rows:**
    - `df.drop_duplicates()`: Remove duplicate rows.

12. **Indexing and Selection:**
    - Selecting rows and columns using indexing and boolean conditions.

13. **Data Visualization:**
    - Pandas can work well with other visualization libraries like Matplotlib and Seaborn for exploratory data analysis.

These are just a subset of the many functions and methods provided by Pandas for data preprocessing. Depending on your specific data and preprocessing needs, you can combine these functions to create powerful data transformation pipelines.

# Importing Libraries.

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# Importing Dataset

In [None]:
dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Machine Learning A-Z (Codes and Datasets)/Part 1 - Data Preprocessing/Section 2 -------------------- Part 1 - Data Preprocessing --------------------/Python/Data.csv")
feature_matrix_x = dataset.iloc[:, :-1].values
dependent_variable_y = dataset.iloc[:, -1].values

In [None]:
# Printing values:
print(feature_matrix_x, dependent_variable_y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]] ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## `Indexing and Selection`

Advanced indexing and selecting in Python, particularly when working with NumPy arrays and Pandas DataFrames, allows you to access and manipulate data using more complex indexing methods. This includes using boolean masks, integer arrays, and multi-dimensional indexing. Let's explore advanced indexing and selection techniques in both NumPy and Pandas:

### NumPy:

1. **Boolean Indexing:**
   You can use boolean arrays to select elements that satisfy a certain condition.
   ```python
   import numpy as np

   arr = np.array([1, 2, 3, 4, 5])
   mask = arr > 2
   selected = arr[mask]  # [3, 4, 5]
   ```

2. **Fancy Indexing:**
   Using arrays of indices to select elements from an array.
   ```python
   arr = np.array([1, 2, 3, 4, 5])
   indices = np.array([0, 3])
   selected = arr[indices]  # [1, 4]
   ```

3. **Multi-dimensional Indexing:**
   Accessing elements in multi-dimensional arrays.
   ```python
   arr = np.array([[1, 2, 3], [4, 5, 6]])
   element = arr[1, 2]  # 6
   ```

### Pandas:

1. **Boolean Indexing:**
   Applying boolean masks to DataFrames to filter rows.
   ```python
   import pandas as pd

   df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
   mask = df['A'] > 1
   selected_rows = df[mask]
   ```

2. **Label-based Indexing:**
   Selecting rows and columns by labels using `.loc`.
   ```python
   selected = df.loc[1, 'B']
   selected_rows = df.loc[df['A'] > 1]
   ```

3. **Position-based Indexing:**
   Selecting rows and columns by integer positions using `.iloc`.
   ```python
   selected = df.iloc[0, 1]
   selected_rows = df.iloc[1:]
   ```

4. **Indexing with MultiIndex:**
   Dealing with multi-level indexing in DataFrames.
   ```python
   df = pd.DataFrame({'A': [1, 2, 3]}, index=[['X', 'X', 'Y'], [1, 2, 1]])
   selected = df.loc['X', 1]
   ```

5. **Using `loc` and `iloc` for Mixed Selection:**
   Combining label-based and position-based indexing for selection.
   ```python
   selected = df.loc['X'].iloc[0]
   ```

6. **Indexing and Selection with `.xs()`:**
   Cross-section selection from hierarchical indices.
   ```python
   xs_selected = df.xs('X', level=0)
   ```

Advanced indexing and selecting are powerful techniques that allow you to extract, modify, and analyze specific portions of data from arrays and DataFrames based on different conditions and positions. These methods are essential for performing more complex data manipulation and analysis tasks.

# Taking care of Missing data

In [None]:
# using the impute module of scikit learn to fill the missing value.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="mean")
feature_matrix_x[:, 1:3] = imputer.fit_transform(feature_matrix_x[:, 1:3])

In [None]:
print(feature_matrix_x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


# Encoding Categorical data.

## Encoding the Independent variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
cTransformer = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough")
feature_matrix_x = np.array(cTransformer.fit_transform(feature_matrix_x))

In [None]:
print(feature_matrix_x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Encoding the dependent variable

In [None]:
from sklearn.preprocessing import LabelEncoder
Ltransformer = LabelEncoder()
dependent_variable_y = Ltransformer.fit_transform(dependent_variable_y)


In [None]:
print(dependent_variable_y)

[0 1 0 0 1 1 0 1 0 1]


# Splitting data in to training & test set.

In [None]:
from sklearn.model_selection import train_test_split
feature_matrix_x_train, feature_matrix_x_test, dependent_variable_y_train, dependent_variable_y_test = train_test_split(feature_matrix_x, dependent_variable_y, test_size = 0.2, random_state = 1)

In [None]:
print(f"xtrain:{feature_matrix_x_train}" )
print(f"xtest: {feature_matrix_x_test}")
print(f"ytrain:{dependent_variable_y_train}")
print(f"ytest:{dependent_variable_y_test}")

xtrain:[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
xtest: [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
ytrain:[0 1 0 0 1 1 0 1]
ytest:[0 1]


# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sScaler = StandardScaler()
feature_matrix_x_train[:, 3:] = sScaler.fit_transform(feature_matrix_x[:, 3:])
feature_matrix_x_test[:, 3:] = sScaler.transform(feature_matrix_x_test[:, 3:])