# Lab 2: Data, data, data!

As we discussed in our lecture, data can make or break a Machine Learning project. In this lab, we will walk through some fundamental steps that must be taken (or at least checked if they need to be done) before we begin model training.

## 1. Data Cleaning
We first need to address issues such as missing values, inconsistencies, and outliers. This ensures that the data is clean, consistent, and ready for analysis or modeling.

Throughout this tutorial, we'll be using a dataset about fantasy world creatures. This dataset includes fictional creatures, their magical power levels, habitats, and more. Our goal is to prepare this dataset to answer the question, **can we predict if they have magic?** First, we will perform some exploratory data analysis (EDA) to understand the data better and visualize interesting aspects.

The dataset contains the following columns:
- `Creature`: Name of the creature.
- `Power_Level`: Magical power level of the creature (scale from 1 to 100).
- `Age`: Age of the creature in years.
- `Wingspan`: Wingspan of the creature in meters.
- `Has_Magic`: Whether the creature has magical abilities (1 for yes, 0 for no).

```python
# Import libraries we will need
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Create a DataFrame with synthetic data
data = {
    'Creature': ['Dragon', 'Dragon', 'Dragon', 'Unicorn', 'Unicorn', 'Phoenix', 'Phoenix', 'Goblin', 'Goblin', 'Elf',
                 'Elf', 'Troll', 'Troll', 'Griffin', 'Griffin', 'Hobbit', 'Hobbit', 'Giant', 'Giant', 'Sphinx', 'Sphinx'],
    'Power_Level': [95, 93, 98, 85, 87, 90, None, 50, 55, 70, 72, 40, 45, 60, 65, 55, None, 80, None, 75, 80],
    'Age': [300, 310, 290, 5000, 155, 100, None, 50, 55, 200, 210, 75, 80, 120, 125, None, 2000, 500, None, 2000],
    'Wingspan': [12.5, 13.0, None, None, None, 15.0, 16.0, 2.0, 50.0, None, 2.2, None, 3.0, 7.0, 8.0, 1.5, 1.8, None, 10.0, 11.0],
    'Has_Magic': [1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1]
}

# create a DataFrame
df = pd.DataFrame(data)

# view the first 5 rows
df.head() # or df.tail() to view the last 5 rows
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create a new code cell and create this dataset for yourself to use (don't forget to import the pandas package and make the data into a DataFrame).

### 1.1. Exploratory Data Analysis

#### Basic Info

Let's see some general information about this data. Here are some easy was to get a sense of what you are working with:

```python
# Get basic information about the dataset
df.info()

# Summary statistics of the dataset
df.describe(include='all')
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Use these functions to understand some basics about the data.

#### Getting number of observations per value
We can see how many observations (rows) have each value within a particular varibale using the `value_counts()` function.

```python
value_counts = df['column_name'].value_counts()
print(value_counts)
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Explore each variable. Do you notice any large repeats?

#### Distribution of a variable
Here is a simple way to visualize the distribution of a variable, AAA.

```python
# Create a histogram for the Power Level distribution
plt.figure(figsize=(8, 6))
plt.hist(df['AAA'].dropna(), bins=10, color='purple', edgecolor='black')

# add some aesthetics
plt.title('Distribution of AAA')
plt.xlabel('AAA')
plt.ylabel('Frequency')

# show the plot
plt.show()
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Pick a variable (or try it for a few) that has is a `float` and plot it's distribution. Is it normal?

#### Using scatter plots
Let's see if there is a relationship between three variables (`AAA`, `BBB`, `CCC`) by visualizing the data as a scatterplot.

```python
# set plot configurations
plt.figure(figsize=(10, 6))
colors = {0: 'red', 1: 'green'}

# plot the data
plt.scatter(df['AAA'], df['BBB'], c=df['CCC'].map(colors), alpha=0.7)

# add aesthetics
plt.title('AAA vs. BBB')
plt.xlabel('AAA')
plt.ylabel('BBB')
plt.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='CCC-0'),
                    plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='orange', markersize=10, label='CCC-1')],
           loc='best')
plt.grid(True)

# show the image
plt.show()
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Pick three variables and test this method out. Remember that the variable you choose for CCC must be categorical!

#### Using bar plots
Calculate average of a numerical variable for each value of a categorical variable.

```python
# group by BBB (categorical) and get the mean AAA (continuous) for each
avg_AAA_by_BBB = df.groupby('BBB')['AAA'].mean()

# create a bar plot
plt.figure(figsize=(10, 6))
avg_AAA_by_BBB.plot(kind='bar')

# add aesthetics
plt.title('Average AAA by BBB')
plt.xlabel('BBB')
plt.ylabel('AAA')

# show the plot
plt.show()
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Are there two particular variables that you think this would be useful for? See if there are any trends.

### 1.2. Handling Missing Values

Now that we've explored our data a bit, we can begin to make changes to it. We'll start with missing values. There are two options: Drop or Impute.

**Drop Missing Values**

One option is to remove rows or columns that contain missing values. This is useful when the amount of missing data is small.

```python
# Drop rows with missing values
df.dropna(inplace=True)
```
**Impute Missing Values**

Filling in missing values with the mean (`.mean()`), median (`.mode()`), or mode (`.mode()`) is a common approach. More sophisticated imputation methods (e.g., kNN, MICE) can also be used but we will keep things simple here. This is helpful if you have less data in general, or if removing missing values would cause imbalances.

```python
# Calculate the mean of a column
column_mean = df['AAA'].mean()

# Fill missing values in the column with the mean
df.fillna({'AAA':column_mean}, inplace=True)
```
Note:
- `inplace=True`: This modifies the original DataFrame df. If you don’t want to modify the original DataFrame, you can set `inplace=False` (or omit the parameter) and assign the result to a new DataFrame.

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Take care of all the missing values in this dataset. Be sure that you can justify your choices!

### 1.3. Removing Duplicates
Check for and remove duplicate rows to ensure that each entry in your dataset is unique. Duplicates cause biases in models.

```python
# Remove duplicate rows
df.drop_duplicates(inplace=True)
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Drop duplicates and check how many were removed!

### 1.4. Correcting Data Types
Ensure each column has the appropriate data type. This can be critical for accurate analysis and modeling.

```python
# Convert data types to int, float, and categorical
df['AAA'] = df['AAA'].astype(int)
df['AAA'] = df['AAA'].astype(float)
df['AAA'] = df['AAA'].astype('category')
```
#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Make sure all data types are correct (Hint: There was a method above that showed you the data type of each variable). If they are't correct, go ahead and fix them.

### 1.5. Handling Outliers
Outliers can skew the data distribution and affect model performance. Common methods to identify and handle outliers include using Z-scores and the Interquartile Range (IQR).

**Z-Score Method**

```python
from scipy import stats

# Calculate Z-scores
df['z_score'] = stats.zscore(df['AAA'])

# Define a threshold for Z-scores (3 is standard)
threshold = 3

# Identify outliers
df['is_outlier'] = np.abs(df['z_score']) > threshold

# Drop outliers
df_cleaned = df[~df['is_outlier']].drop(columns=['z_score', 'is_outlier'])
```
**IQR Method**

```python
# Calculate quartiles
Q1 = df['AAA'].quantile(0.25)
Q3 = df['AAA'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
df['is_outlier'] = (df['values'] < lower_bound) | (df['values'] > upper_bound)
df.head()

# Drop outliers
df_cleaned = df[(df['values'] >= lower_bound) & (df['values'] <= upper_bound)]
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Identify and handle outliers. Be sure that you can justify why you used the methods you did!

### 1.6. Encoding Categorical Variables
Convert categorical data into numerical format using techniques like one-hot encoding.

```python
# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=['AAA', 'BBB'])
print(df_encoded)
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Encode the `Creature` variable.

## 2. Feature Engineering for Machine Learning
Feature engineering is a crucial step in the machine learning pipeline that transforms raw data into meaningful features, significantly impacting model performance. This lesson covers key aspects of feature engineering, including feature extraction, scaling, normalization, and selection techniques.

### 2.1. Feature Extraction
Feature extraction involves transforming raw data into a set of features that can be used for machine learning models. This process helps in capturing relevant information and can lead to improved model performance.

#### Common Feature Extraction Techniques

***Note***: The below techniques are used for both classification and regression tasks.

**Principal Component Analysis (PCA)** is a dimensionality reduction technique that transforms data into a set of linearly uncorrelated components, maximizing the variance. Choosing 2 for the number of components is a common practice when you want to visualize high-dimensional data in a 2D plot. It allows you to plot the data points on a 2D plane, which helps in understanding the structure and distribution of the data. Reducing to 2 dimensions is often used for exploratory data analysis and visualization.

```python
from sklearn.decomposition import PCA

# Initialize PCA
pca = PCA(n_components=2)  # Reducing to 2 dimensions
df_pca = pca.fit_transform(df)

# see how much variance is explained
print("Explained variance ratio:", pca.explained_variance_ratio_)

# inspect the  components to understand which original features are most influential in the new component space
print("Principal components:\n", pca.components_)
```
This will result in an entirely new set of features. Since we are early on in the class, and we want to keep some interpretability to our features, we will skip this for right now. But remember that this code is here when you need it in the future!

**Polynomial Features** creates new features by considering polynomial combinations of existing features, capturing interactions between them.

```python
from sklearn.preprocessing import PolynomialFeatures

# Columns to apply polynomial features
columns_to_transform = ['AAA', 'BBB']

# Separate the features
features_to_transform = df[columns_to_transform]
features_not_to_transform = df.drop(columns=columns_to_transform)

# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)  # add quadratic features

# Fit and transform the specified features
poly_features = poly.fit_transform(features_to_transform)

# Create DataFrame for the polynomial features
poly_features_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(columns_to_transform), index=df.index)

# Combine polynomial features with non-transformed features
df_transformed = pd.concat([poly_features_df, features_not_to_transform], axis=1)
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create one or more polynomial features from our current `df`. Be sure that you can justify your choice.

**Feature Engineering from Dates**
Extract meaningful components from date-time variables such as year, month, day, and weekday.

```python
# Sample data (note: there are other date-time formats, but this is most common)
datedf = pd.DataFrame({'date': pd.to_datetime(['2023-01-01', '2023-02-01', '2023-03-01'])})

# Extract features
datedf['year'] = datedf['date'].dt.year
datedf['month'] = datedf['date'].dt.month
datedf['day'] = datedf['date'].dt.day
datedf['weekday'] = datedf['date'].dt.weekday

print(datedf.head())
```

We don't have a date variable in our current `df`, but keep this in mind for Assignment #1!

### 2.2. Feature Scaling
Feature scaling ensures all features contribute equally to model training. It helps improve the performance of algorithms sensitive to feature scales. Both of the below techniques are used in both classification and regression tasks.

#### Key Techniques

**Standardization (Z-score Normalization)**

Transforms features to have a mean of 0 and a standard deviation of 1.

```python
from sklearn.preprocessing import StandardScaler

# columns to scale
columns_to_scale = ['AAA', 'BBB']

# separate the features
features_to_scale = df[columns_to_scale]
features_not_to_scale = df.drop(columns=columns_to_scale)

# initialize and apply StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_to_scale)

# convert scaled features back to DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=columns_to_scale, index=df.index)

# Combine scaled and non-scaled features
df_scaled = pd.concat([scaled_features_df, features_not_to_scale], axis=1)

print(df_scaled)
```

**Min-Max Scaling (Normalization)**

Scales features to a fixed range, usually [0, 1].

*Same process as above with slight changes:*

```python
from sklearn.preprocessing import MinMaxScaler

# pick and separate the features here

# Initialize and apply MinMaxScaler
scaler = MinMaxScaler()
normalized_features = scaler.fit_transform(features)

# continue by adding the normalized features back into the dataset
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Compute both Standardization and Min-Max Scaling on a subset of features in the data.

**When to Use Different Scaling Techniques**

***Standardization***: Use when features have different units or scales, and you need to normalize for algorithms sensitive to feature magnitudes.

***Min-Max Scaling***: Use when features need to be bounded within a specific range, especially for algorithms assuming features are within a fixed range.

### 2.3. Feature Normalization
Normalization adjusts features to fit a specific range or achieve a certain norm, often useful for algorithms relying on distance metrics.

#### Techniques

**L2 Normalization (Vector Normalization)**
Scales features so that the sum of squares of feature values (across the rows) is 1. In other words, this has to be done across all features.

```python
from sklearn.preprocessing import Normalizer

columns_to_normalize = ['AAA', 'BBB']
features_to_normalize = df[columns_to_normalize]

# Initialize and apply Normalizer
normalizer = Normalizer(norm='l2')
normalized_features = normalizer.fit_transform(features_to_normalize)

# convert scaled features back to DataFrame
scaled_features_df = pd.DataFrame(normalized_features, columns=columns_to_normalize, index=df.index)

print(scaled_features_df)

# add those normalized features back to the dataset
```

**L1 Normalization**

Scales features so that the sum of absolute values of the features is 1. Same code process as above except change the norm:

```python
# Initialize and apply Normalizer
normalizer = Normalizer(norm='l1')
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Which method is most appropriate? Apply it!

**Summary of Differences**

***Feature Scaling***: Adjusts the range or distribution of feature values. Techniques include standardization, min-max scaling, robust scaling, and maxabs scaling.

***Feature Normalization***: Typically adjusts features to fit a specific range or achieve a certain norm. Techniques include min-max normalization, L1 normalization, and L2 normalization.

### 2.4. Feature Selection
Feature selection involves choosing the most relevant features to improve model performance, reduce overfitting, and decrease training time.

#### Filter Methods
Evaluate the relevance of features based on statistical tests.

**Chi-Square Test** is used for classification tasks and assesses associations between *categorical variables*.

```python
from sklearn.feature_selection import SelectKBest, chi2

# Sample data
X = pd.DataFrame({'feature1': [tall, short, short, short], 'feature2': [blue, green, red, blue]}, {'feature3': [.....)
y = [0, 1, 0, 1]

# Initialize and apply SelectKBest
selector = SelectKBest(score_func=chi2, k=1)  # Select top 1 feature
X_new = selector.fit_transform(X, y)

print("Selected feature indices:", selector.get_support(indices=True))
```

**ANOVA F-test** is used to evaluate the relevance of each feature. Features with high F-values are considered more relevant for predicting the target variable. This is used for *classification* and most commonly for *continuous features* (it can be used for categorical if they are encoded).

```python
from sklearn.feature_selection import f_classif

# Apply ANOVA F-value (for a singluar feature)
feature = df[['AAA']]
y = df['YYY']

F_value, p_value = f_classif(feature, y)
print(F_value, p_value)

# To apply to all features, must split apart the features and outcome variable into own dfs
F_values, p_values = f_classif(features_dataframe, y)
```

**Correlation Coefficient** is a method used to identify which features have the strongest linear relationships with the target variable. This is used for *regression* and *continuous features*.

```python
# Compute the correlation matrix
correlation_matrix = df.corr()

# Get correlation of each feature with the target variable
target_correlation = correlation_matrix['YYY'].drop('YYY')  # Drop the target itself

# Print correlation coefficients
print("Correlation coefficients with the target variable:")
print(target_correlation)

# Select features with high correlation coefficients
# For example, selecting features with correlation coefficient greater than 0.5 or less than -0.5
selected_features = target_correlation[abs(target_correlation) > 0.5].index
```

#### <font color='red'>**TRY IT**</font> &#x1f9e0;: Which method of these is most appropriate? Go ahead and apply it.

### Wrapper Methods
Evaluate feature subsets by training a model and assessing performance. We will not be using this so much (as it requires more computational resources), but just so that you have some example code if you decide to explore it:

**Recursive Feature Elimination (RFE)**

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# sample data
X = pd.DataFrame({'feature1': [1, 2, 3, 4], 'feature2': [5, 6, 7, 8]})
y = [0, 1, 0, 1]

# initialize model and RFE
model = LogisticRegression()
selector = RFE(model, n_features_to_select=1)
X_new = selector.fit_transform(X, y)

print("Selected features:", selector.support_)
```

### Embedded Methods

Perform feature selection as part of the model training process. We will not go over these in depth, as we need to understand some more about regression and classification first. But here is some example code to familiarize yourself with.

**Lasso Regression (L1 Regularization)**

Regression Example: Predict house prices using the California housing dataset.

```python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Load the California Housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='Price')

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Apply Lasso Regression
lasso = Lasso(alpha=0.1)  # alpha is the regularization parameter
lasso.fit(X_train, y_train)

# Print the coefficients
print("Lasso coefficients:")
print(lasso.coef_)

# Evaluate the model
y_pred = lasso.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

# Feature selection
selected_features = X.columns[lasso.coef_ != 0]
print("Selected features:", selected_features)
```

Classification Example: Predict species of Iris flowers using the Iris dataset.

```python
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='species')

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply Logistic Regression with L1 Regularization
lasso = LogisticRegression(penalty='l1', solver='liblinear')  # liblinear solver supports L1 penalty
lasso.fit(X_train, y_train)

# Print the coefficients
print("Lasso coefficients:")
print(lasso.coef_)

# Evaluate the model
y_pred = lasso.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

# Feature selection
selected_features = X.columns[(lasso.coef_ != 0).any(axis=0)]
print("Selected features:", selected_features)
```

**Random Forest**

Regression Example: Predict house prices using the Boston housing dataset.

```python
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Load the California Housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='Price')

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Apply Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Print feature importances
print("Feature importances:")
importances = rf.feature_importances_
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance:.4f}")

# Evaluate the model
y_pred = rf.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
```

Classification Example: Predict species of Iris flowers using the Iris dataset.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='species')

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Print feature importances
print("Feature importances:")
importances = rf.feature_importances_
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance}")

# Evaluate the model
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```