<a href="https://colab.research.google.com/github/hargurjeet/Adhoc-Activities/blob/main/Data_Preprocessing_Using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DATA PREPROCESSING USING PYTHON**

The notebook comprises a collection of commonly used preprocessing techniques that are essential for conducting effective data analysis and data science activities.

In [1]:
## Data imports
import pandas as pd

## Data manuplation
import numpy as np

## Imputing and preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer
from scipy import stats

## building Visuals
import seaborn as sns
import matplotlib.pyplot as plt

### **Importing Data**

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")
titanic_df.head()

## **Handling Missing Values and Outliers**

In [None]:
titanic_df.info()

## Key Observation - 
- Age and Cabin columns seems to have missing data

### **Mean Imputation**

In [None]:
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean') ## Startegy can be mode and median based on the requirement
titanic_df[['Age']] = imputer.fit_transform(titanic_df[['Age']])

# Verify the changes
print(titanic_df['Age'].isnull().sum())

### **Imputation Using ML Algo**

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")
titanic_df.info()

In [None]:
# Initialize the IterativeImputer
imputer = IterativeImputer(random_state=0)

# Fit and transform the data
age_imputed = imputer.fit_transform(titanic_df[['Age']])

# Replace the original 'Age' column with the imputed data
titanic_df['Age'] = age_imputed

# Check if there are any missing values left
print(titanic_df['Age'].isnull().sum())

### **Outlier Identification - Visual**

In [None]:
# Show the plot
sns.set_style("whitegrid")
sns.boxplot(x=titanic_df['Fare'])
sns.set(rc={'figure.figsize':(8,6)})
plt.xlabel('Fare')
plt.title('Boxplot of Fare column in Titanic dataset')
plt.show()

### Outlier Identification - IQR

In [None]:
# Calculate the IQR
q1 = titanic_df['Fare'].quantile(0.25)
q3 = titanic_df['Fare'].quantile(0.75)
iqr = q3 - q1

# Calculate the upper and lower bounds
lower_bound = q1 - 1.5*iqr
upper_bound = q3 + 1.5*iqr

# print(lower_bound, upper_bound)
# Identify the outliers and filtering them out
titanic_df = titanic_df[(titanic_df['Fare'] >= lower_bound) & (titanic_df['Fare'] <= upper_bound)]

# Show the plot
sns.set_style("whitegrid")
sns.boxplot(x=titanic_df['Fare'])
sns.set(rc={'figure.figsize':(8,6)})
plt.xlabel('Fare')
plt.title('Boxplot of Fare column in Titanic dataset')
plt.show()

### Outlier Identification - Z Score

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")

z_scores = stats.zscore(titanic_df['Fare'])
abs_z_scores = np.abs(z_scores)

# Setting the threshold z-score value to 3
threshold = 3
outlier_indices = np.where(abs_z_scores > threshold)[0]
outlier_values = titanic_df['Fare'][outlier_indices]

print("Number of outliers:", len(outlier_values))
print("Outlier values:", outlier_values)

## **Handling Categorical Data**

### Mode Imputation

In [None]:
## Impute missing values in 'Sex' and 'Embarked' columns with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
titanic_df[['Embarked']] = imputer.fit_transform(titanic_df[['Embarked']])

# Verify the changes
print(titanic_df[['Embarked']].isnull().sum())

### KNN Imputation, Missing Category and Label Encoding

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")

# print the null records
print(titanic_df[['Embarked']].isnull().sum())

# Encode the 'Sex' column
le = LabelEncoder()

# Encode the 'Embarked' column
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('Unknown')
titanic_df['Embarked'] = le.fit_transform(titanic_df['Embarked'])

# define the KNN imputer with k=5
imputer = KNNImputer(n_neighbors=5)

# impute missing values in the 'Sex' and 'Embarked' columns
titanic_imputed = imputer.fit_transform(titanic_df[['Embarked']])

# print the imputed dataset
titanic_df['Embarked'] = titanic_imputed
print(titanic_df[['Embarked']].isnull().sum())

## **Categorical Data - Encoding Techniques**

### One Hot Encoding

In [None]:
# Create an instance of the OneHotEncoder class
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# One hot encode 'Sex' and 'Embarked' columns
encoded_cols = encoder.fit_transform(titanic_df[['Sex', 'Embarked']])

# Convert the encoded columns to a dataframe and append to the original dataframe
encoded_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['Sex', 'Embarked']))
titanic_df = pd.concat([titanic_df, encoded_df], axis=1)

In [None]:
encoded_df

## **Feature Scaling**

### Min Max scaler

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")
titanic_df.describe()

In [None]:
# select numerical columns
num_cols = ['Age', 'Fare']

# create StandardScaler object
scaler = MinMaxScaler()

# fit and transform the data
titanic_df[num_cols] = scaler.fit_transform(titanic_df[num_cols])

# print the standardized data
print(titanic_df[num_cols].describe())

### Log tranformations

In [None]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/hargurjeet/Adhoc-Activities/main/Titanic_train.csv")

# Apply log transformation to the "Fare" column
titanic_df_log = np.log(titanic_df['Fare']).copy()

sns.histplot(data= titanic_df, x= 'Fare')

In [None]:
sns.histplot( x=titanic_df_log)

### Power tranformations

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer

# Generate some skewed data
data = np.random.exponential(scale=2, size=1000)

# Visualize the data distribution
plt.hist(data, bins=50)
plt.title("Original data distribution")
plt.show()

# Apply power transformation
pt = PowerTransformer(method='yeo-johnson')
data_pt = pt.fit_transform(data.reshape(-1, 1))

# Visualize the transformed data distribution
plt.hist(data_pt, bins=50)
plt.title("Power-transformed data distribution")
plt.show()


## **Reference Material**

- Read menthod pandas - https://realpython.com/pandas-read-write-files/
- Sklearn Imputer - https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
- Sklearn pre processing - https://scikit-learn.org/stable/modules/preprocessing.html