# Preprocessing for ML in Python
1. Intro to Data Preprocessing
2. Standardizing Data
3. Feature Engineering
4. Selecting features for modeling
5. Putting it all together

# 1. Intro to Data Preprocessing
- after cleaning and EDA
- = prepping data for modeling

Pandas review:
- df.columns
- df.dtypes
- df.describe()

In [None]:
# Remove missing data

# Drop all rows
df.dropna()

# Drop specific rows (default rows) using index labels
df.drop([1,2,3])

# Drop specific columns
df.drop("A", axis=1)

# Subset based on value
df[df["B"] == 7]
# Subset non-null values
df_subset = df[df['column_name'].notnull()]

# Count NaN in column "B"
df["B"].isnull().sum()

# Filter out NaN in "B" column
df[df["B"].notnull()]

# drop columns with at least 3 NaN, axes 0/row or 1/column
df.dropna(axis=1, thresh=3)

## 1.1 pandas data types
- object: string/mixed types
- int64: integer
- float64: float
- datetype64 (or timedelta): datetime

In [None]:
print(df.dtypes)

## 1.2 type conversion

In [None]:
# change column type
df['C'] = df['C'].astype('float')

## 1.3 Training and Test Sets - Stratified sampling
- 100 samples, 80 class 1 and 20 class 2
- Training set: 75 samples, 60 class 1 and 15 class 2
- Test set: 25 samples, 20 class 1 and 5 class 2


In [None]:
# stratified sampling with sklearn

# check stratification of column/feature
y['labels'].value_counts()
# note: stratify parameter
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y)
y_train['labels'].value_counts()
y_test['labels'].value_counts()

In [None]:
# another stratified sampling example

# Create a data with all columns except category_desc
volunteer_X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
# stratify arg = target feature
X_train, X_test, y_train, y_test = train_test_split(
    volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

# 2. Standardizing Data
- preprocessing method to normalize data

2 methods discussed:
- log normalization
- feature scaling

When to standardize
- model in linear space (knn, linear regression, kmeans clustering)
- features have high variance
- features that are continuous and on different scales

## 2.1 Log normalization
- standardizes data with columns that have high variance
- applies log transformation
- advantages:
    - captures relative changes
    - magnitude of change
    - keeps positive values

In [None]:
# log normalization in python
import numpy as np

# check dataframe
df
df.var()

# apply log normalization to column 2
df['col2_log'] = np.log(df['col2'])

In [None]:
# example: log normalization
import numpy as np

# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the Proline column again
print(wine['Proline_log'].var())

## 2.2 Scaling data for feature comparison
- scaling does: 1. mean=0, 2. variance = 1
- normalization

### Sklearn has a scaler, but better to use StandardScaler()

### fit_transform()
Using this will fit the method to the data and transform in a single step.

In [None]:
# StandardScaler()
# note: used .fit_transform()
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), 
                         columns=df.columns)
print(df_scaled)
# variance should be the same
print(df.var())

In [None]:
# example: scale 3 columns to use in a linear model 

#Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash','Alcalinity of ash', 'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)ear model


## 2.3 Standardized data and modeling

In [None]:
# example: k-nearest neighbors on unscaled data

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# preprocessing first
X_train, X_test, y_train, y_test = train_test_split(X, y)

knn = kNeighborsClassifier()
knn.fit(X_train, y_train)

knn.score(X_test, y_test)

In [None]:
# example2: knn on scaled data

# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

# increased accuracy with scaled data

# 3. Feature Engineering
- Feature engineering = creation of new features based on existing features
- gives insight into relationships b/n features
- extract and expand data
- it is dataset dependent

## 3.1 Encoding categorical variables - binarization
- in pandas: 
- in scikit learn: 

In [None]:
# in pandas
print(users['subscribed'])

users['sub_enc'] = users['subscribed'].apply(
    lambda val:1 if val=='y' else 0)

# look at columns side by side
print(users[['subscribed', 'sub_enc']])

In [None]:
# fit_transform() in scikit-learn
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
users['sub_enc_le'] = le.fit_transform(users['subscribed'])

print(users[['subscribed', 'sub_enc_le']])

# can use this in pipeline

In [None]:
# One-hot encoding - use for 2+ variables in a column
# use get_dummies()

print(users['fav_color'])
print(pd.get_dummies(users['fav_color']))


In [None]:
# example: binary encoding categorical variables, .fit_transform()

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

In [None]:
# example: one-hot encoding for 2+ categories (ie. yes, no, maybe)
# get_dummies()

# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
print(category_enc.head())

## 3.2 Engineering numerical features
- aggregate stats, dates, and how to add value to numerical features
- common method of feature engineering: take aggregate of a set of numbers to use in place of features (ie. mean, median)

In [None]:
# aggregate stats: create a mean column

# df of cities on rows, columns are temp for each day
print(df)
# make a list of columns
columns = ['day1','day2','day3']
# axis=1 to operate across a row
df['mean'] = df.apply(lambda row: row[columns].mean(), axis=1)
# creates a single mean value column
print(df)

In [None]:
# aggregate stats: dates and timestamps

# collection of dates
print(df)
# extract month by converting to a datetime column
df['date_converted'] = pd.to_datetime(df['date'])
df['month'] = df['date_converted'].apply(lambda row: row.month)
print(df)
print(df[['date','month']].head())

## 3.4 Text classification - text feature engineering
Methods:
- extract part of string or number
- NLP methods

In [None]:
# Extract features from strings using regular expressions (regex)
import re
my_string = 'temperature: 75.6 F'

# d=digits, + = as many as possible
# \. = period
pattern = re.compile('\d+\.\d+')
temp = re.match(pattern, my_string)
print(float(temp.group(0)))

In [None]:
# vectorizing text: Tf idf
# Tf idf vectors = term frequency + inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

print(documents.head())

tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)

# Naive Bayes text classifier
# uses conditional probability


In [None]:
# Example: regex string feature extraction

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking['Length'].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

In [None]:
# Example: tf/idf string vectorization

# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

# Use vectors to predict 'category_desc' column
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
# Note: toarray() method on tf/idf vector for sklearn format
# stratify parameter = y, since the class distribution is uneven
train_X, test_X, train_y, test_y = train_test_split(
    text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data with Naive Bayes' fit()
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

# out: 0.47096774193548385

# 4. Feature Selection for modeling
- drop redundant features
- work with text vectors
- use PCA to reduce number of features and decrease overall variance
- iterative process

## 4.1 Removing redundant features
- noisy features
- strongly correlated features
    - linear models assume feature independence
    - use Pearson correlation coefficient
- duplicated features

In [None]:
# Check correlated features
df.corr()
# column A and B have a correlation of 1, so 
# prob drop one of the features

In [None]:
# example: drop redundant features

# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
print(volunteer.head())

In [None]:
# check correlated features

# Print out the column correlations of the wine dataset
print(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

## 4.2 Selecting features using text vectors
- example: you can take something like top 20% of weighted words across the vector
- iterate through different subsets of the tf-idf vector

In [None]:
# pull out words and weights on a per document basis

# view word weights
# vector of location descriptions
# row data: word weights and index ie. '200': 0,...'ahead': 3
print(tfidf_vec.vocabulary_)

# view 4th row
print(text_tfidf[3].data)
# get indices of words that are weighted
print(text_tfidf[3].indices)
# reverse key value pairs in the vocabulary
vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
print(vocab)

# zip the row indices and weights and turn into a dictionary
zipped_row = dict(zip(text_tfidf[3].indices, text_tfidf[3].data))
print(zipped_row)



In [None]:
# function: looking at word weights

def return_weights(vocab, vector, vector_index):
    """
    Function performs row zipping to a dictionary.
    Return a dictionary mapping the word to its score.
    args:
        vocab - reversed vocab list
        vector - the vector
        vector_index - the row we want
    """
    zipped = dict(zip(vector[vector_index].indices,
                     vector[vector_index].data))
    return {vocab[i]:zipped[i] for i in vector[vector_index].
           indices}

# pass in reversed vocab list, the text tfidf, and index for 4th row
# Result: mapping of scores to words
print(return_weights(vocab, text_tfidf, 3))

# you can sort by score or eliminate words below a certain threshold


In [None]:
# Explore text vectors - part 1
# Add to return_weights() function and return a list of numbers 
# within the function.

# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, 
                   top_n):
    zipped = dict(zip(vector[vector_index].indices, 
                      vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in 
                               vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[
        :top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
# vector_index=8 grabs the 9th row
# top_n=3 grabs top 3 weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, 
                     text_tfidf, 8, 3))

# out
[189, 942, 466]

In [None]:
# Explore text vectors - part 2
# Write another function to collect the top words across
# all documents, extract them, return a list of word indices
# and use that list to filter the text_tfidf vector.

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, 
        # and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, 
                                vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word 
    # indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_,
text_tfidf, 3)

# By converting filtered_words back to a list, 
# we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

# next, train a model using the filtered vector

### 4.2.a Training Naive Bayes with feature selection
- rerun the Naive Bayes text classification model with selection choices from previous exercise on volunteer dataset's title and category_desc columns.

In [None]:
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), volunteer['category_desc'], stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

# out
# 0.4838709677419355
# the title field is a very small text field used to demonstrate
# filtering vectors, so accuracy not great here

## 4.4 Dimensionality reduction with PCA
- unsupervised learning method
- combines/decomposes a feature space
- it's a feature extraction method - we'll use it here to reduce the feature space

PCA - principal component analysis
- linear transformation to uncorrelated space
- captures as much variance as possible in each component

PCA caveats
- difficult to interpret components beyond the ones that explain the most variance
- black box method
- end of preprocessing journey - difficult to feature work post-pca except eliminating components that don't explain much variance

In [None]:
# PCA in scikit-learn
from sklearn.decomposition import PCA

pca = PCA()
df_pca = pca.fit_transform(df)

# print out new pca transformed vector 
print(df_pca)
# print explained variance ratio
# See the %age of variance explained by the component
# Look to drop components that don't explain much of the variance
print(pca.explained_variance_ratio_)

In [None]:
# Using PCA
# wine dataset to try and improve model accuracy

from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different 
# components
print(pca.explained_variance_ratio_)

# out
# [9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
#  1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
#  1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
#  8.25392788e-08]

In [None]:
# PCA - train a model using pca transformed vector

# Split the transformed X and the y labels into training and 
# test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = 
train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

# out
# 0.6666666666666666

# 5. Putting it all together
- Entire preprocessing workflow
- UFO dataset and preprocessing

Important concepts:
- missing data: dropna() and notnull()
- types: astype()
- stratified sampling: train_test_split(X, y, stratify=y)

## 5.1 Preprocessing

### 5.1.a Check and change column types

In [None]:
# check and change column types

# Check the column types
print(ufo.dtypes)

# Change the type of seconds to int
ufo["seconds"] = ufo['seconds'].astype('float')

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo['date'])

# Check the column types
print(ufo[['seconds','date']].dtypes)

### 5.1.b Drop missing data

In [None]:
# Check how many values are missing in the length_of_time, state, 
# and type columns
print(ufo[['length_of_time', 'state', 'type']].isnull.sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
          ufo["state"].notnull() & 
          ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

## 5.1 Categorical variables and standardization
- one-hot encoding with pd.get_dummies()

Standardization: 
- var()
- np.log()

In [None]:
# Extract numbers from UFO['length_of_time'] string with regex

def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(lambda val: 
                                             return_minutes(val))

# Take a look at the head of both of the columns
print(ufo[['length_of_time','minutes']].head())

In [None]:
# identify features for standardization by looking at variance

# Check the variance of the seconds and minutes columns
print(ufo['seconds'].var(), ufo['minutes'].var())
# out:
# seconds    424087.417474
# minutes       117.546372
# dtype: float64
# note: variance of seconds is really high

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo['seconds_log'].var())
# out
# 1.1223923881183004

## 5.2 Engineering new features
- extract month from date field
- extract digits from length_of_time field
- vectorize text from desc field

Review tips:
- dates: .month or .hour attributes
- regex: \d and .group()
- text: tf-idf and TfidfVectorizer

### 5.2.a Encoding categorical variables
- use binary and one-hot code methods

In [None]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda val:
                                            1 if val=='us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

### 5.2.b Features from dates
- feature engineering with month and year extraction

In [None]:
# Look at the first 5 rows of the date column
print(ufo['date'].head(5))

# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda x:x.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda x:x.year)

# Take a look at the head of all three columns
print(ufo[['date','month','year']].head())

### 5.2.c Text vectorization
- transform desc column into tf/idf vectors

In [None]:
# Take a look at the head of the desc field
print(ufo['desc'].head)

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

## 5.3 Feature selection and modeling
- redundant features
- text vector
- iterate preprocessing and modeling

### 5.3.a Selecting the ideal dataset
Let's get rid of some of the unnecessary features. Because we have an encoded country column, country_enc, keep it and drop other columns related to location: city, country, lat, long, state.

We have columns related to month and year, so we don't need the date or recorded columns.

We vectorized desc, so we don't need it anymore. For now we'll keep type.

We'll keep seconds_log and drop seconds and minutes.

Let's also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [None]:
# Check the correlation between the seconds, seconds_log, and 
# minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes']].corr())

# Make a list of features to drop
to_drop = ['city','country','lat','long','state','date',
'recorded','desc','seconds','minutes','length_of_time']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_,
desc_tfidf, 4)

### 5.3.b Modeling the UFO dataset - Part 1
- build a k-nearest neighbor model to predict which country the UFO sighting took place in
- X dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place
- y labels are the encoded country column, where 1 is us and 0 is ca

In [None]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y,
stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

# out
# 0.8843683083511777

### 5.3.c Modeling the UFO dataset - Part 2
- build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector
- see if we can predict the type of the sighting based on the text. We'll use a Naive Bayes model for this.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
print(nb.score(test_X, test_y))

# out
# 0.14989293361884368

# since accuracy is poor, need to iterate and figure what
# subset of text improves the model