## Data Preprocessing
Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves the cleaning and transformation of raw data into a format that is suitable for analysis or input to a machine learning model. The main goals of data preprocessing are to improve the quality of the data and enhance the performance and effectiveness of machine learning models. The choice of preprocessing techniques is influenced by the nature of the data, and different algorithms are applied accordingly to address unique challenges associated with diverse data types.

## Data Types

The columns in a Pandas DataFrame can contain different types of data. Here are some common types you might encounter:

1. **Numerical Data:**
   - **Discrete:** These are numerical data that have a countable number of distinct values. For example, the **number of cars** in a parking lot or the **number of students in a classroom**.
   - **Continuous :** They can take any numeric value within a range and have an infinite number of possible values. Examples include **height, weight, or temperature**.

2. **Categorical Data:**
   - **Nominal:** These type of data represent categories without any inherent order or ranking. Examples include **gender, color, or types of fruits**.
   - **Ordinal:** They have categories with a meaningful order. Examples include socio economic status **(low income,middle income,high income), education level (high school,BS,MS,PhD)**.

3. **Datetime :**
   - **Datetime (datetime64):** Datetime data, also referred to as timestamp or time series data, represents information related to dates and times.

4. **Sparse Data:**
    - Sparse data refers to data where a large proportion of the elements have a value of zero. This type of data is common in various fields, such as natural language processing, recommendation systems, and network analysis. An example of sparse data is given below where rows represent users and columns represent movies. Each entry in the dataset indicates whether a user has rated a particular movie.

    ```
    User       Movie A   Movie B   Movie C   Movie D   Movie E
    User 1        4         0         0         0         0
    User 2        0         0         0         5         0
    User 3        0         0         3         0         0
    User 4        0         0         0         0         2
    User 5        0         1         0         0         0
    ```

    - Sparse data often requires specific techniques and algorithms to efficiently handle and analyze, as processing all the zero values can be computationally expensive and may not provide meaningful insights.

    - The `scipy.sparse` module provides a variety of sparse matrix types and operations for sparse matrix manipulations. It includes formats such as CSR (Compressed Sparse Row), CSC (Compressed Sparse Column), COO (Coordinate), and others.


In [18]:
import pandas as pd
url = 'https://drive.google.com/file/d/19aYZVyCsbKp0UEQl8QQagKyHFmromwQg/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,two
1,10.34,1.66,Male,No,Sun,Dinner,three
2,21.01,3.5,Male,No,Sun,Dinner,three
3,23.68,3.31,Male,No,Sun,Dinner,two
4,24.59,3.61,Female,No,Sun,Dinner,four


1. **total_bill (Numeric):** Represents the total bill amount for a meal, usually a float.

2. **tip (Numeric):** Represents the tip amount given by the customer, usually a float.

3. **sex (Categorical):** Represents the gender of the person paying the bill, often categorized as "Male" or "Female."

4. **smoker (Categorical):** Indicates whether the party was a smoker or non-smoker, often categorized as "Yes" or "No."

5. **day (Categorical):** Represents the day of the week when the meal took place, categorized as "Thur," "Fri," "Sat," or "Sun."

6. **time (Categorical):** Indicates whether the meal was lunch or dinner.

7. **size (Categorical):** Represents the size of the dining party.


## Handling Missing Values

Handling missing values is crucial in data science for several reasons:

1. **Data Quality and Accuracy:** Missing values can introduce inaccuracies in the analysis, leading to incorrect conclusions. By addressing missing data, you improve the overall quality and accuracy of your dataset.

2. **Statistical Power:** Missing data can reduce the statistical power of your analysis, making it challenging to draw meaningful conclusions. Proper handling of missing values ensures that your statistical tests are more robust and reliable.

3. **Model Performance:** Many machine learning algorithms cannot handle missing values directly. Fitting models with missing data may lead to biased or inaccurate predictions.

4. **Avoiding Biases:** If the missing data is not handled properly, it may introduce bias into the analysis. This bias can impact the generalizability of your findings and lead to incorrect interpretations.

5. **Data Visualization:** Missing values can affect the visual representation of data, making it challenging to create accurate charts and graphs. Addressing missing data improves the clarity and reliability of data visualizations.

In [21]:
import pandas as pd
import numpy as np

# Sample DataFrame  with missing values (you can also use the data in above cell)
data = {
    'Name': ['Alice', 'Bob', np.nan, 'David', 'Emma'],
    'Age': [25, np.nan, 22, 35, 28],
    'Score': [85, 92, np.nan, 95, 89],
    'City': ['New York', 'San Francisco', 'Los Angeles', np.nan, 'Boston']
}

df = pd.DataFrame(data)
print("Sample DataFrame with Missing Values:\n", df)
# 1. Detecting Missing Data:
# The isnull() method can be used to detect missing values in a DataFrame.
# It returns a DataFrame of the same shape, where each element is True or False
# based on whether the corresponding element in the original DataFrame is missing.
missing_values = df.isnull()
print("\n1. Detect missing values:\n", missing_values)

# Count the number of missing values in each column
number_of_missing_values = df.isnull().sum()
print("\nNumber of missing values in each column:\n", number_of_missing_values)

# 2. Dropping Missing Values:
# The dropna() method can be used to remove rows or columns containing missing values.
# The axis parameter specifies whether to drop rows (axis=0) or columns (axis=1).

# Drop rows containing missing values
df_no_missing_rows = df.dropna()
print("\n2. DataFrame after dropping rows with missing values:\n", df_no_missing_rows)

# Drop columns containing missing values
df_no_missing_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:\n", df_no_missing_cols)

# 3. Filling Missing Values:
# The fillna() method can be used to fill missing values with a specific value or with the result of a function.

# Fill missing values with a specific value (e.g., 0)
df_filled = df.fillna(0)
print("\n3. DataFrame after filling missing values with 0:\n", df_filled)

# Fill missing values with the mean of each numerical column, will fill NA for non-numerical column missing values
df_mean_filled = df.fillna(df.mean(numeric_only=True))
print("\nDataFrame after filling missing values with the mean:\n", df_mean_filled)

# 4. Interpolation:
# The interpolate() method provides a way to perform linear interpolation to fill missing values based on the values around them.

# Interpolate missing values
df_interpolated = df.interpolate()
print("\n4. DataFrame after interpolating missing values:\n", df_interpolated)
#for example: output shows row 1    Bob  23.5   92.0  San Francisco, here 23.5 comes from (25+22)/2, where 25 is from row 0, and 22 is from row 2.

Sample DataFrame with Missing Values:
     Name   Age  Score           City
0  Alice  25.0   85.0       New York
1    Bob   NaN   92.0  San Francisco
2    NaN  22.0    NaN    Los Angeles
3  David  35.0   95.0            NaN
4   Emma  28.0   89.0         Boston

1. Detect missing values:
     Name    Age  Score   City
0  False  False  False  False
1  False   True  False  False
2   True  False   True  False
3  False  False  False   True
4  False  False  False  False

Number of missing values in each column:
 Name     1
Age      1
Score    1
City     1
dtype: int64

2. DataFrame after dropping rows with missing values:
     Name   Age  Score      City
0  Alice  25.0   85.0  New York
4   Emma  28.0   89.0    Boston

DataFrame after dropping columns with missing values:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

3. DataFrame after filling missing values with 0:
     Name   Age  Score           City
0  Alice  25.0   85.0       New York
1    Bob   0.0   92.0  San Francisco
2      0 

  df_interpolated = df.interpolate()


## Encoding Categorical Variables

Encoding, in the context of data science and machine learning, refers to the process of converting data from one form to another. It involves representing information in a different format, often to make it suitable for a specific purpose or to meet the requirements of a particular algorithm. Encoding is commonly used when dealing with categorical variables, text data, or other types of information that need to be transformed for analysis or modeling. It is a crucial step in the data preprocessing phase, especially when working with machine learning models.

Most Common Categorical Data Encoding Techniques are:

1. **Label Encoding:**

    - Assigns a unique integer to each category.
    - Suitable for ordinal data where there is a natural order among categories.
    - Not recommended for nominal data as it may imply misleading relationships.

2. **One-Hot Encoding:**

    - Creates binary columns for each category (0 or 1).
    - Suitable for nominal data without any natural order.
    - Increases dimensionality but avoids false ordinal relationships.

In [38]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Example 1: Label Encoding for Natural Order Data
# Natural order example: Education levels (Ordinal data)
data_ordinal = {
    'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
}

df_ordinal = pd.DataFrame(data_ordinal)
print("Original Data (Ordinal):\n", df_ordinal)

# Label Encoding: Assigns unique integers to each category, reflecting the natural order
# Here, we assume the order: High School < Bachelor < Master < PhD

#                labels them based on alphabetical order if not specified
label_encoder = LabelEncoder()
df_ordinal['Education_LabelEncoded'] = label_encoder.fit_transform(df_ordinal['Education'])
print("\nLabel Encoded Data (Ordinal with Natural Order):\n", df_ordinal)

#The issue with Label Encoding is that it doesn't inherently know the order of your categories; it assigns numbers based on the alphabetical order of the categories by default.
#This means that the order "High School < Bachelor < Master < PhD" may not be preserved as expected. Instead, it sorts the categories alphabetically before assigning numbers.


encoding_map = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_ordinal['Education_LabelEncoded'] = df_ordinal['Education'].map(encoding_map)

print("\nLabel Encoded Data (Correct Order for Ordinal Data):\n", df_ordinal)



# Example 2: One-Hot Encoding for Nominal Data without Natural Order
# Nominal data example: Colors (no natural order)
data_nominal = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green']
}

df_nominal = pd.DataFrame(data_nominal)
print("\nOriginal Data (Nominal):\n", df_nominal)

# One-Hot Encoding: Creates binary columns for each category (0 or 1)
one_hot_encoded = pd.get_dummies(df_nominal, columns=['Color'])
#get_dummies() converts categorical variables into a series of binary (0 or 1) columns. Each column represents a unique category from the original column.
print("\nOne-Hot Encoded Data (Nominal without Natural Order) True/False:\n", one_hot_encoded)

one_hot_encoded = pd.get_dummies(df_nominal, columns=['Color']).astype(int)
print("\nOne-Hot Encoded Data (Nominal without Natural Order) 0, 1:\n", one_hot_encoded)



# Apply get_dummies with drop_first=False to keep all dummy variables

#we drop the first row since we have 3, if on 2nd and 3rd are 0,0 then we know that the first row is is true
one_hot_encoded_df = pd.get_dummies(df_nominal, columns=['Color'], drop_first=True).astype(int)
print("\nOne-Hot Encoded Data with drop_first=False:\n", one_hot_encoded_df)
#drop_first=False (default): Keeps all the dummy variables, including one for each category.
#drop_first=True: Drops the first category to avoid multicollinearity in cases where the encoded data will be used in regression models.
#This approach is particularly useful when you need to reduce redundancy and avoid perfectly collinear variables.

Original Data (Ordinal):
      Education
0  High School
1     Bachelor
2       Master
3          PhD
4     Bachelor
5       Master

Label Encoded Data (Ordinal with Natural Order):
      Education  Education_LabelEncoded
0  High School                       1
1     Bachelor                       0
2       Master                       2
3          PhD                       3
4     Bachelor                       0
5       Master                       2

Label Encoded Data (Correct Order for Ordinal Data):
      Education  Education_LabelEncoded
0  High School                       0
1     Bachelor                       1
2       Master                       2
3          PhD                       3
4     Bachelor                       1
5       Master                       2

Original Data (Nominal):
    Color
0    Red
1   Blue
2  Green
3   Blue
4    Red
5  Green

One-Hot Encoded Data (Nominal without Natural Order) True/False:
    Color_Blue  Color_Green  Color_Red
0       False        F

## Feature Scaling

Feature scaling is a data preprocessing technique used to scale the values of features ( or attributes) in a dataset to a standard range. The primary goal of feature scaling is to ensure that all the features have similar scales, which can help improve the performance of many machine learning algorithms. It is particularly important when using algorithms that are sensitive to the scale of input features, such as gradient descent-based optimization methods (e.g., in neural networks) and distance-based algorithms (e.g., k-nearest neighbors or support vector machines).


1. Min-Max Scaling (Normalization):
   - Scales each feature to a given range (default is between 0 and 1).
   - This is useful for normalizing features that need to fit within a bounded interval.
   - The formula for min-max scaling is:
     ```
     X_normalized = (X - X_min) / (X_max - X_min)
     ```
   Here, X is the original feature value, X_normalized is the normalized value, X_min is the minimum value in the feature, and X_max is the maximum value in the feature.

    **The function below accepts a Pandas series as a parameter and returns the scaled data:**
    ```python
    def minmax_scale(column):
        min_val = column.min()
        max_val = column.max()
        scaled_column = (column - min_val) / (max_val - min_val)
        return scaled_column
    ```
   

2. Standardization:
   - Also known as z-score standardization
   - Scales features to have a mean of 0 and a standard deviation of 1.
   - Commonly used when the data should follow a normal distribution for algorithms like SVMs and logistic regression.
   - The formula for standardization is:
     ```
     X_standardized = (X - mean) / standard deviation
     ```
   - Here, X is the original feature value, X_standardized is the standardized value, mean is the mean of the feature values, and the standard deviation is the standard deviation of the feature values.
   
    **The function below accepts a Pandas series as a parameter and returns the scaled data:**
   ```python
    def zscore_standardize(column):
        mean_val = column.mean()
        std_dev = column.std()
        standardized_column = (column - mean_val) / std_dev
        return standardized_column
   ```

3. Robust Scaling:
   - Scales features using statistics that are robust to outliers (median and IQR).The Interquartile Range (IQR) is a measure of statistical dispersion and is defined as the range between the first quartile (25th percentile, Q1) and the third quartile (75th percentile, Q3) of a dataset. IQR=Q3âˆ’Q1.
   - Useful when data contains many outliers or is not normally distributed.
   - The formula for robust scaling is:
     ```
     X_robust = (X - X_median) / (Q3 - Q1)
     ```
   Here, X is the original feature value, X_robust is the robust-scaled value, Q1 is the first quartile, and Q3 is the third quartile of the feature values.

       **The function below accepts a Pandas series as a parameter and returns the scaled data:**

      ```python
      def robust_scale(column):
        median_val = column.median()
        iqr = column.quantile(0.75) - column.quantile(0.25)
        scaled_column = (column - median_val) / iqr
        return scaled_column
        ```

In [39]:
import pandas as pd

# Sample DataFrame with some numeric data
data = {
    'Feature1': [100, 150, 200, 250, 300],
    'Feature2': [1, 2, 3, 4, 5],
    'Feature3': [10, 100, 1000, 10000, 100000]  # This column has a large range, showing how scaling affects it
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Min-Max Scaling Function
def minmax_scale(column):
    min_val = column.min()
    max_val = column.max()
    scaled_column = (column - min_val) / (max_val - min_val)
    return scaled_column

# Standardization (Z-score) Function
def zscore_standardize(column):
    mean_val = column.mean()
    std_dev = column.std()
    standardized_column = (column - mean_val) / std_dev
    return standardized_column

# Robust Scaling Function
def robust_scale(column):
    median_val = column.median()
    iqr = column.quantile(0.75) - column.quantile(0.25)
    scaled_column = (column - median_val) / iqr
    return scaled_column

# Applying the scaling functions to the DataFrame columns
df['Feature1_MinMax'] = minmax_scale(df['Feature1'])
df['Feature1_Standardized'] = zscore_standardize(df['Feature1'])
df['Feature1_Robust'] = robust_scale(df['Feature1'])

df['Feature2_MinMax'] = minmax_scale(df['Feature2'])
df['Feature2_Standardized'] = zscore_standardize(df['Feature2'])
df['Feature2_Robust'] = robust_scale(df['Feature2'])

df['Feature3_MinMax'] = minmax_scale(df['Feature3'])
df['Feature3_Standardized'] = zscore_standardize(df['Feature3'])
df['Feature3_Robust'] = robust_scale(df['Feature3'])

# Display the DataFrame with scaled features
print("\n DataFrame with Scaled Features:\n", df)


Original DataFrame:
    Feature1  Feature2  Feature3
0       100         1        10
1       150         2       100
2       200         3      1000
3       250         4     10000
4       300         5    100000

DataFrame with Scaled Features:
    Feature1  Feature2  Feature3  Feature1_MinMax  Feature1_Standardized  \
0       100         1        10             0.00              -1.264911   
1       150         2       100             0.25              -0.632456   
2       200         3      1000             0.50               0.000000   
3       250         4     10000             0.75               0.632456   
4       300         5    100000             1.00               1.264911   

   Feature1_Robust  Feature2_MinMax  Feature2_Standardized  Feature2_Robust  \
0             -1.0             0.00              -1.264911             -1.0   
1             -0.5             0.25              -0.632456             -0.5   
2              0.0             0.50               0.000000       

Instead of manually implementing feature scaling functions, we can use built-in functions from libraries like scikit-learn, which provides efficient and easy-to-use scaling functions for Min-Max Scaling, Standardization, and Robust Scaling.

In [41]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Sample DataFrame  with some numeric data
data = {
    'Feature1': [100, 150, 200, 250, 300],
    'Feature2': [1, 2, 3, 4, 5],
    'Feature3': [10, 100, 1000, 10000, 100000]  # This column has a large range, showing how scaling affects it
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Initializing the scalers
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()

# Applying Min-Max Scaling
df_minmax_scaled = pd.DataFrame(minmax_scaler.fit_transform(df), columns=df.columns)
df_minmax_scaled.columns = [f"{col}_MinMax" for col in df_minmax_scaled.columns]

# Applying Standardization (Z-score Scaling)
df_standard_scaled = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)
df_standard_scaled.columns = [f"{col}_Standardized" for col in df_standard_scaled.columns]

# Applying Robust Scaling
df_robust_scaled = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)
df_robust_scaled.columns = [f"{col}_Robust" for col in df_robust_scaled.columns]

# Concatenating the results into one DataFrame for comparison
df_scaled = pd.concat([df, df_minmax_scaled, df_standard_scaled, df_robust_scaled], axis=1)

# Display the DataFrame with scaled features
print("\nDataFrame with Scaled Features:\n", df_scaled)


Original DataFrame:
    Feature1  Feature2  Feature3
0       100         1        10
1       150         2       100
2       200         3      1000
3       250         4     10000
4       300         5    100000

DataFrame with Scaled Features:
    Feature1  Feature2  Feature3  Feature1_MinMax  Feature2_MinMax  \
0       100         1        10             0.00             0.00   
1       150         2       100             0.25             0.25   
2       200         3      1000             0.50             0.50   
3       250         4     10000             0.75             0.75   
4       300         5    100000             1.00             1.00   

   Feature3_MinMax  Feature1_Standardized  Feature2_Standardized  \
0         0.000000              -1.414214              -1.414214   
1         0.000900              -0.707107              -0.707107   
2         0.009901               0.000000               0.000000   
3         0.099910               0.707107               0.707107  