# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Matplotlib is building the font cache; this may take a moment.


## Importing the dataset

In [2]:
dataset = pd.read_csv("Data.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values 


### Understanding Dataset Slicing Notation

When working with pandas DataFrames and NumPy arrays, **slicing** is a powerful way to select specific portions of your data. Let's break down the notation used above:

#### Basic Slicing Syntax: `[rows, columns]`

The general format is `dataset[row_selection, column_selection]` where:
- **First position (before comma)**: Selects rows
- **Second position (after comma)**: Selects columns

#### Row Selection (`:`)
- **`:`** means "select all rows"
- **`0:5`** would select rows 0 through 4 (5 is excluded)
- **`10:`** would select from row 10 to the end
- **`:20`** would select from the beginning up to row 19

#### Column Selection (`:-1`)
- **`:`** would select all columns
- **`:-1`** selects all columns **except the last one**
- **`:3`** would select columns 0, 1, and 2
- **`1:`** would select from column 1 to the end
- **`[0, 2, 4]`** would select specific columns by index

#### Why `:-1` for Features?
In machine learning datasets, the convention is often:
- **Features (input variables)**: All columns except the last
- **Target (output variable)**: The last column

So `x = dataset.iloc[:, :-1].values` means:
- **`:`** → Take all rows
- **`:-1`** → Take all columns except the last one
- **`.values`** → Convert from pandas DataFrame to NumPy array

And `y = dataset.iloc[:, -1].values` means:
- **`:`** → Take all rows  
- **`-1`** → Take only the last column (the target variable)

#### Negative Indexing
- **`-1`** refers to the last element
- **`-2`** refers to the second-to-last element
- **`:-1`** means "up to but not including the last element"
- **`-3:`** means "from the third-to-last element to the end"

#### Practical Examples of Slicing

Let's see some examples using our dataset to understand how different slicing patterns work:

In [None]:
# Let's first look at the dataset structure
print("Dataset shape:", dataset.shape)
print("\nDataset columns:", dataset.columns.tolist())
print("\nFirst 5 rows of the dataset:")
print(dataset.head())

In [None]:
# Example 1: Select all rows, all columns except the last (features)
features = dataset.iloc[:, :-1]
print("Features (all columns except last):")
print(features.head())
print(f"Shape: {features.shape}")

print("\n" + "="*50 + "\n")

# Example 2: Select all rows, only the last column (target)
target = dataset.iloc[:, -1]
print("Target (last column only):")
print(target.head())
print(f"Shape: {target.shape}")

print("\n" + "="*50 + "\n")

# Example 3: Select first 3 rows, all columns
first_three = dataset.iloc[:3, :]
print("First 3 rows, all columns:")
print(first_three)

print("\n" + "="*50 + "\n")

# Example 4: Select all rows, first 2 columns
first_two_cols = dataset.iloc[:, :2]
print("All rows, first 2 columns:")
print(first_two_cols.head())

print("\n" + "="*50 + "\n")

# Example 5: Select rows 2-4, columns 1-2
subset = dataset.iloc[2:5, 1:3]
print("Rows 2-4, columns 1-2:")
print(subset)

In [3]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [8]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:,1:2])
x[:, 1:2] = imputer.transform(x[:,1:2])

In [9]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

### Encoding the Dependent Variable

## Splitting the dataset into the Training set and Test set

## Feature Scaling