<a href="https://colab.research.google.com/github/kusheshgangwar/Data-preprocessing/blob/main/DATA_PREPROCESSING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [5]:
# Sample Data (Load your dataset)
data = pd.read_csv("Data.csv")
df = pd.DataFrame(data)

In [6]:
# *1. Handle Missing Values* (Fill NaN with mean for numeric columns only)
for col in df.select_dtypes(include=np.number).columns: # Iterate through numeric columns only
    df[col].fillna(df[col].mean(), inplace=True)  # Fill NaN with mean for current column

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)  # Fill NaN with mean for current column


In [7]:
# *2. Handle Outliers* (Using Z-Score Method)
from scipy.stats import zscore
df_numeric = df.select_dtypes(include=[np.number])  # Select only numerical columns
z_scores = np.abs(zscore(df_numeric))
df = df[(z_scores < 3).all(axis=1)]  # Keep only valid data

In [8]:
# *3. Normalize/Scale Features*
scaler = StandardScaler()
# Get the actual column names from your DataFrame
features_to_scale = df.select_dtypes(include=np.number).columns.tolist()
# Remove the target column if it's numerical
if 'Target' in features_to_scale:
    features_to_scale.remove('Target')

# Scale the selected features
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

In [12]:
# *4. Split Data into Training & Testing Sets*
# Get the actual column names from your DataFrame
# Ensure 'target' is the actual name of your target column (case-sensitive)
# Convert column names to lowercase to ensure consistency
df.columns = df.columns.str.lower()

# Check if the target column exists and rename if necessary
if 'target' not in df.columns:
    # Assuming 'Target' or similar is the original name
    # Updated to check for 'segment' column (case-insensitive) based on global variable 'col' and DataFrame preview.
    potential_target_columns = [col for col in df.columns if 'segment' in col.lower()]
    if potential_target_columns:
        df.rename(columns={potential_target_columns[0]: 'target'}, inplace=True)
    else:
        # Raise an error or handle the case where no target column is found
        raise KeyError("Target column not found in the DataFrame. Please check your data.")

# Now that the target column is ensured to be 'target' and in lowercase
feature_columns = df.drop(columns=['target']).columns.tolist()
#... (rest of the code remains the same)

# You don't need to convert feature_columns to lowercase again,
# as the DataFrame columns are already lowercase.

X = df[feature_columns]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
# *Check Processed Data*
print("Training Features:\n", X_train.head())
print("Testing Features:\n", X_test.head())
print("Training Labels:\n", y_train.head())
print("Testing Labels:\n", y_test.head())

Training Features:
                       country    product   discount band   units sold   \
82                     France      Paseo             Low    $2,155.00    
51   United States of America   Amarilla            None    $1,143.00    
220                    Mexico   Amarilla          Medium    $1,683.00    
669                    Mexico    Montana            High      $546.00    
545                    France    Montana            High    $1,186.00    

     manufacturing price   sale price     gross sales     discounts   \
82                $10.00      $350.00    $7,54,250.00     $7,542.50    
51               $260.00        $7.00       $8,001.00          $-      
220              $260.00        $7.00      $11,781.00       $589.05    
669                $5.00      $300.00    $1,63,800.00    $24,570.00    
545                $5.00      $300.00    $3,55,800.00    $42,696.00    

             sales            cogs          profit         date  month number  \
82    $7,46,707.50   