# **Module 1: Machine Learning**
## **Introduction**

## Data Preprocessing

Data preprocessing is a crucial step in the machine learning pipeline, involving the preparation and transformation of raw data into a format suitable for building and training models. Effective preprocessing ensures that the data is clean, relevant, and ready for analysis, ultimately leading to better model performance. Data preprocessing involves transforming raw data into a clean and usable format. This includes handling missing data, normalization, standardization, and encoding categorical variables.

### Steps in Data Preprocessing:
**1. Data Collection:**
- Gathering raw data from various sources, such as databases, files, APIs, and sensors.

In [None]:
import pandas as pd

# Load dataset
#data_url = 'https://example.com/dataset.csv'
#df = pd.read_csv(data_url)

data = {'feature1': [1, 4, 7],
        'feature2': [2, 5, 8],
        'feature3': [3, 6, 9],
        'target': [1, 1, 0]}
df = pd.DataFrame(data)


**2. Data Cleaning:**
- Handling Missing Values: Techniques include imputation (mean, median, mode, or using algorithms), or removing rows/columns with missing values.
- Removing Duplicates: Identifying and removing duplicate entries to avoid bias.
- Handling Outliers: Detecting and managing outliers, either by removal or transformation.

In [None]:
# Handling missing values
df.fillna(df.mean(), inplace=True)  # Imputation with mean

# Removing duplicates
df.drop_duplicates(inplace=True)

# Handling outliers (example using z-score method)
from scipy import stats
import numpy as np

z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
df = df[(z_scores < 3).all(axis=1)]


**3. Data Integration:**
- Combining data from different sources into a single cohesive dataset. This may involve merging tables, joining data, and ensuring consistency.

In [None]:
# Assuming another dataframe df2 to merge

data2 = {'feature1': [1, 4, 9],
        'feature2': [2, 5, 8],
        'feature3': [3, 6, 7],
        'target': [1, 1, 0]}
df2 = pd.DataFrame(data2)

# Merging dataframes on a common key
#mergedData = pd.merge(df, df2)
mergedData = pd.merge(df, df2, on='feature1')
print(mergedData)


**4. Data Transformation:**
- Normalization/Standardization: Scaling features to a standard range, such as [0, 1] or to have mean 0 and variance 1.
- Encoding Categorical Variables: Converting categorical data into numerical format using techniques like one-hot encoding, label encoding, or binary encoding.
- Feature Engineering: Creating new features from existing data to enhance model performance.
- Dimensionality Reduction: Reducing the number of features using techniques like PCA, LDA, or t-SNE.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
numerical_cols = X.select_dtypes(include=[np.number]).columns

# Create transformers for preprocessing
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Apply transformations
X_processed = preprocessor.fit_transform(X)
print(X_processed)


**5. Data Splitting:**
- Dividing the dataset into training, validation, and test sets. Common splits are 70-20-10 or 80-20 for training and testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

These steps help in transforming raw data into a format that enhances the performance and accuracy of machine learning models. By using libraries like Pandas and Scikit-learn, these tasks can be efficiently handled in Python. Combining all the steps into a cohesive pipeline (please notice the load of the data set changed from pulling it from a URL to setting up manually for ease the running of the program):

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
#data_url = 'https://example.com/dataset.csv'
#df = pd.read_csv(data_url)

data = {'feature1': [1, 4, 7],
        'feature2': [2, 5, 8],
        'feature3': [3, 6, 9],
        'target': [1, 1, 0]}
df = pd.DataFrame(data)

# Data cleaning
df.fillna(df.mean(), inplace=True)
df.drop_duplicates(inplace=True)

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
numerical_cols = X.select_dtypes(include=[np.number]).columns

# Create transformers for preprocessing
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a pipeline that includes preprocessing and model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')