In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 

warnings.filterwarnings('ignore')

## Introduction

Data preprocessing takes place after exploratory data analysis and cleaning

We preprocess the data to: 
- transform the dataset so its suitable for modeling
  AND
- to improve model performance

## 1. Importing data

In [None]:
df = pd.read_csv('../data/volunteer_opportunities.csv')

## 2. Inspecting

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.describe()

## 3. Remove Missing Data

In [None]:
df.isna().sum()

- *df.dropna()* -> if only a few rows contain missing data
- *df.drop([1,2,3])* -> drops specific rows
- *df.drop('column_name', axis=1)* -> drops columns
- *df.dropna(subset=['column_name'])* -> drops rows where column_name is empty
- *df.dropna(thresh=2)* -> drop columns with 2 or more missing values

## 4. Typing

Pandas infer data types, sometimes incorrectly.

The *.info()* method shows the datatype of each column as well as *.dtypes*

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df_toy = pd.DataFrame({'A': ['1.0', '2.0']})
df_toy.info()

In [None]:
df_toy['A'] = df_toy['A'].astype('float')

In [None]:
df_toy.info()

In [None]:
df_toy.dtypes

## 5. Training and test split

Splitting the dataset into training and test helps: 
- reducing overfitting
- evaluate performance on a holdout set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, df.category_desc, test_size=0.2, random_state=42)

**Stratified sampling** helps keeping all the classes represented in the target test dataset when it is very imbalanced.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, df.category_desc, test_size=0.2, random_state=42, stratify=df.category_desc)

## 6. Standardization

**Standardization** is the process to transform **continuous** data to appear **normally distributed**

Many of the sklearn models assume normally distributed data. Using non-normal data could bias the models.

Standardization is required: 
- When we are using a model in linera space (like KNN, linear regression or KMeans)
- When the dataset features have high variance
- Features are on different scales (for instance number of bedrooms vs price)

## 6.1. Log Normalization

Useful for features with high variance

Applies logarithm transformation

Natural log using the constant $e$




In [None]:
c
df['logs'] = np.log(df['values'])

df

Captures relative changes, the magnitude of change, and keeps everything positive.

## 6.2. Scaling 

Features on different scales

Model with linear characteristics

Center features around 0 and transform variance to 1

Transforms to approximately normal distribution


In [None]:
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'col1': [0.1, 0.2, 0.3],
    'col2': [10, 5.2, 8.3],
    'col3': [120, 100.2, 89.3]
})

scaler = StandardScaler()

df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

df_scaled

In [None]:
np.var(df_scaled)

## 6. Standardized data and modeling

Its important to split the data before preprocessing the data, otherwise there would be a **data leakage** and the test data could have been showed somehow to the model.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

## 7. Feature Engineering

**Feature engineering** is the creation of new features from existing ones

It adds information to the dataset that can improve the performance of the model or add insight into relationships between features

Before doing feature engineering we must understand our data first.

Feature engineering is highly dependent on the dataset we have at hand

Typical feature engineering scenarios are extracting features from free text, or parsing strings containing dates.

## 8. Encoding categorical variables

Sklearn models requires numerical input only. If there is any categorical data, it has to be encoded.

### 8.1 Encoding variables with 2 different values

In [None]:
df = pd.DataFrame({
    'user': [1,2,3, 4],
    'subscribed': ['y','y','n', 'y'],
    'fav_color': ['yellow', 'orange', 'orange', 'green']
})

In [None]:
df.subscribed

In [None]:
# Pandas way
df.subscribed.apply(lambda x: 1 if x=='y' else 0)

In [None]:
# SciKit Learn way 

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit_transform(df.subscribed)

### 8.2 One Hot Encoding

Applies when the variable has more than 2 different values.



In [None]:
pd.get_dummies(df.fav_color)

## 9. Engineering Numerical Features

Examples of reducing dimensionality: Means or medians of several variables, extracting month or week from dates


In [None]:
volunteer = pd.read_csv('../data/volunteer_opportunities.csv')
volunteer.head()

In [None]:
volunteer.columns

In [None]:
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])
volunteer["start_date_month"] = volunteer['start_date_converted'].dt.month

In [None]:
print(volunteer[['start_date_converted', 'start_date_month']].head())

## 10. Engineering Text Features

### 10.1. Extraction

Regular expressions is code that identify patterns


In [None]:
import re

my_string = 'temperature:75.6 F'
temp = re.search('\d+\.\d+', my_string)

temp

If we are working with text it could be helpful to model it in some way.

**TF/IDF** (Term Frequency/Inverse Document Frequency) Vectorizes words based upon importance



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

volunteer.summary

In [None]:
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(volunteer.summary)

In [None]:
text_tfidf

## 11. Feature Selection 

Feature selection is the step where some features are picked from the feature set to be used for modeling.

It doesnt create new features.

The goal is to improve models performance

There are many ways to perform feature selection, some are automated.

Reducing noise, removing features strongly statistically correlated, reducing overall variance...

### 11.1. Removing redundant features 

- Remove noisy features (almost the same information)
- Remove highly correlated features. The Pearson correlation coefficient is handy for this.
- Remove duplicated features (often created during feature engineering)
- Remove features with repeated information (for instance latlon and city)
- Remove the original values used to compute an aggregated value

In [None]:
df_corr = pd.DataFrame({
    'user': [-1,2,3, 4],
    'other': [2, 1, 3, -1],
    'another': [25,-24,2,1]
})

df_corr

In [None]:
df_corr.corr()

### 11.2. Selecting features using text vectors

In [None]:
hiking = pd.read_json('../data/hiking.json')

from sklearn.feature_extraction.text import TfidfVectorizer

tfid_vec = TfidfVectorizer()

tfid_text = tfid_vec.fit_transform(hiking.Location)

In [None]:
tfid_vec.vocabulary_

In [None]:
tfid_text[3].data

In [None]:
tfid_text[3].indices

In [None]:
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
vocab

In [None]:
zipped_row = dict(zip(text_tfidf[3].indices, text_tfidf[3].data))
zipped_row

In [None]:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, tfid_text, 8, 3))

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = tfid_text[:, list(filtered_words)]

## 12. Dimensionality Reduction and PCA

Its an unsupervised learning method

Combines/decomposes a feature space. It shrinks the number of features on the feature space.

Dimensionality reduction is a feature extraction method since the data is transformed into new and different features.

PCA stands for principal component analysis and uses a linear transformation to project features into a linear space where they can be completely uncorrelated.

While the feature space is reduce, the variance is captured in a meaningful way by combining features into components

In [None]:
from sklearn.decomposition import PCA

df_pca = pd.DataFrame({
    'user': [-1,2,3, 4],
    'other': [2, 1, 3, -1],
    'trololo': [20, 11, -53, 21],
    'another': [25,-24,2,1]
})

pca = PCA()
df_pca = pca.fit_transform(df_pca)
df_pca

In [None]:
pca.components_

In [None]:
pca.explained_variance_ratio_

PCA has though some caveats: 
- Its components are difficult to interpret
- It fits better right at the end of the preprocessing journey. It would be hard to do anything with the pca components after its transformation
