**What is data preprocessing?**
___
- beyond cleaning and exploratory data analysis
- prepping data for modeling
    e.g. transforming categorical data to numeric
- Pandas
    - .columns
    - .dtypes
    - .describe()
    - remove missing data
        - .dropna() - drop rows with NA values (axis=0, thresh=1)
        - df["B"].isnull().sum() - sum of all null values for specific column
        - df[df["B"].notnull()] - index for values that are not null for specific columns
    - .drop([1, 2, 3]) - drop specific rows
    - .drop("A", axis = 1) - drop specific columns
    - df[df["B"] == 7] - boolean indexing of columns
___

In [None]:
#Missing data - rows

#Taking a look at the volunteer dataset again, we want to drop rows
#where the category_desc column values are missing. We're going to do
#this using boolean indexing, by checking to see if we have any null
#values, and then filtering the dataset so that we only have rows
#with those values.

# Check how many values are missing in the category_desc column
#print(volunteer["category_desc"].isnull().sum())

# Subset the volunteer dataset
#volunteer_subset = volunteer[volunteer["category_desc"].notnull()]

# Print out the shape of the subset
#print(volunteer_subset.shape)

#################################################
#<script.py> output:
#    48
#    (617, 35)
#################################################
#Remember that you can use boolean indexing to effectively subset
#DataFrames.

**Working with data types**
___
- Why are types important?
    - .dtypes
        - object - string/mixed types
        - int64 - integer
        - float64 - float
        - datetime64 (or timedelta) - datetime
- Converting column types
    - .astype("float")
___

In [None]:
#Converting a column type

#If you take a look at the volunteer dataset types, you'll see that
#the column hits is type object. But, if you actually look at the
#column, you'll see that it consists of integers. Let's convert that
#column to type int.

# Print the head of the hits column
#print(volunteer["hits"].head())

# Convert the hits column to type int
#volunteer["hits"] = volunteer["hits"].astype("int")

# Look at the dtypes of the dataset
#print(volunteer.dtypes)

#################################################
#<script.py> output:
#    0    737
#    1     22
#    2     62
#    3     14
#    4     31
#
#    Name: hits, dtype: object
#    opportunity_id          int64
#    content_id              int64
#    vol_requests            int64
#    event_time              int64
#    title                  object
#    hits                    int64
#    summary                object
#    is_priority            object
#    category_id           float64
#    category_desc          object
#    amsl                  float64
#    amsl_unit             float64
#    org_title              object
#    org_content_id          int64
#    addresses_count         int64
#    locality               object
#    region                 object
#    postalcode            float64
#    primary_loc           float64
#    display_url            object
#    recurrence_type        object
#    hours                   int64
#    created_date           object
#    last_modified_date     object
#    start_date_date        object
#    end_date_date          object
#    status                 object
#    Latitude              float64
#    Longitude             float64
#    Community Board       float64
#    Community Council     float64
#    Census Tract          float64
#    BIN                   float64
#    BBL                   float64
#    NTA                   float64
#    dtype: object
#################################################

**Class distribution**
___
- How do you split train/test when your samples are not normally distributed?
- Stratified sampling
    - from train_test_split method .value_counts()
    - parameter for train_test_split is "stratify="
___

In [None]:
#Stratified sampling

#We know that the distribution of variables in the category_desc
#column in the volunteer dataset is uneven. If we wanted to train a
#model to try to predict category_desc, we would want to train the
#model on a sample of data that is representative of the entire
#dataset. Stratified sampling is a way to achieve this.

# Create a data with all columns except category_desc
#volunteer_X = volunteer.drop("category_desc", axis=1)

# Create a category_desc labels dataset
#volunteer_y = volunteer[["category_desc"]]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
#X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
#print(y_train["category_desc"].value_counts())

#################################################
#<script.py> output:
#    Strengthening Communities    230
#    Helping Neighbors in Need     89
#    Education                     69
#    Health                        39
#    Environment                   24
#    Emergency Preparedness        11
#    Name: category_desc, dtype: int64
#    Strengthening Communities    230
#    Helping Neighbors in Need     89
#    Education                     69
#    Health                        39
#    Environment                   24
#    Emergency Preparedness        11
#    Name: category_desc, dtype: int64
#################################################
#ou'll use train_test_split frequently while building models, so
#it's useful to be familiar with the function.

**Standardizing Data**
___
- scikit-learn models assume normally distributed data
- applied to continuous numerical data
- linearity assumptions
- types discussed
    - log normalization
    - feature scaling
- when to standardize models
    - model in linear space
    - dataset features have high variance
    - dataset features are continuous and on different scales
___

In [None]:
#Modeling without normalizing

#Let's take a look at what might happen to your model's accuracy if
#you try to model data without doing some sort of standardization
#first. Here we have a subset of the wine dataset. One of the
#columns, Proline, has an extremely high variance compared to the
#other columns. This is an example of where a technique like log
#normalization would come in handy, which you'll learn about in
#the next section.

#The scikit-learn model training process should be familiar to you
#at this point, so we won't go too in-depth with it. You already
#have a k-nearest neighbors model available (knn) as well as the X
#and y sets you need to fit and score on.

# Split the dataset and labels into training and test sets
#X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
#knn.fit(X_train, y_train)

# Score the model on the test data
#print(knn.score(X_test, y_test))

#################################################
#<script.py> output:
#    0.5333333333333333
#################################################
#You can see that the accuracy score is pretty low. Let's explore
#methods to improve this score.

**Log normalization**
___
- applies log transformation to feature(s) with high variance relative to other features
- helps feature(s) approach normality
- takes the log of e (2.718)
    - e.g. log 30 = 3.4, log 300 = 5.7, log 3000 = 8
- captures relative changes, the magnitude of change, and keeps everything in the positive space
___

In [None]:
#Log normalization in Python
#Now that we know that the Proline column in our wine dataset has a
#large amount of variance, let's log normalize it.

#Numpy has been imported as np in your workspace.

# Print out the variance of the Proline column
#print(wine["Proline"].var())

# Apply the log normalization function to the Proline column
#wine["Proline_log"] = np.log(wine["Proline"])

# Check the variance of the normalized Proline column
#print(wine["Proline_log"].var())

#################################################
#<script.py> output:
#    99166.71735542436
#    0.17231366191842012
#################################################
#  The np.log() function is an easy way to log normalize a column.

**Scaling data for feature comparison**
___
- What is feature scaling?
    - features on different scales
    - model with linear characteristics
    - center features with mean of zero and transform unit variance to same
    - transforms to approximately normal distribution
___

In [None]:
#Scaling data - standardizing columns

#Since we know that the Ash, Alcalinity of ash, and Magnesium
#columns in the wine dataset are all on different scales, let's
#standardize them in a way that allows for use in a linear model.

# Import StandardScaler from scikit-learn
#from sklearn.preprocessing import StandardScaler

# Create the scaler
#ss = StandardScaler()

# Take a subset of the DataFrame you want to scale
#wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to the DataFrame subset
#wine_subset_scaled = ss.fit_transform(wine_subset)

#################################################
# In scikit-learn, running fit_transform during preprocessing will
#both fit the method to the data as well as transform the data in a
#single step.

**Standardized data and modeling**
___

In [None]:
#KNN on non-scaled data

#Let's first take a look at the accuracy of a K-nearest neighbors
#model on the wine dataset without standardizing the data. The knn
#model as well as the X and y data and labels sets have been created
#already. Most of this process of creating models in scikit-learn
#should look familiar to you.

# Split the dataset and labels into training and test sets
#X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
#knn.fit(X_train, y_train)

# Score the model on the test data
#print(knn.score(X_test, y_test))

#################################################
#<script.py> output:
#    0.6444444444444445
#################################################
#This scikit-learn workflow should be very familiar to you at this point.

In [None]:
#KNN on scaled data

#The accuracy score on the unscaled wine dataset was decent, but we
#can likely do better if we scale the dataset. The process is mostly
#the same as the previous exercise, with the added step of scaling
#the data. Once again, the knn model as well as the X and y data and
#labels set have already been created for you.

# Create the scaling method.
#ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
#X_scaled = ss.fit_transform(X)
#X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
#knn.fit(X_train, y_train)

# Score the model on the test data.
#print(knn.score(X_test, y_test))

#################################################
#<script.py> output:
#    0.9555555555555556
#################################################
#The increase in accuracy is worth the extra step of scaling the dataset.

**Feature engineering**
___
- What is feature engineering?
    - the creation of new features based on existing features
    - insight into relationships between features
    - extract and expand data
    - dataset-dependent
- scenarios
    - text data
    - categorical data
    - time stamps
    - averages
___

**Encoding categorical variables**
___
- encoding binary values
    - Pandas
        - .apply() plus lambda function
        - .get_dummies() for one-hot encoding of variables with two or more labels
    - scikit-learn
        - LabelEncoder
___

In [None]:
#Encoding categorical variables - binary

#Take a look at the hiking dataset. There are several columns here
#that need encoding, one of which is the Accessible column, which
#needs to be encoded in order to be modeled. Accessible is a binary
#feature, so it has two values - either Y or N - so it needs to be
#encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to
#do that transformation.

# Set up the LabelEncoder object
#enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
#hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
#print(hiking[["Accessible_enc", "Accessible"]].head())

#################################################
#<script.py> output:
#       Accessible_enc Accessible
#    0               1          Y
#    1               0          N
#    2               0          N
#    3               0          N
#    4               0          N
#################################################
#.fit_transform() is a good way to both fit an encoding and
#transform the data in a single step.

In [None]:
#Encoding categorical variables - one-hot

#One of the columns in the volunteer dataset, category_desc, gives
#category descriptions for the volunteer opportunities listed.
#Because it is a categorical variable with more than two categories,
#we need to use one-hot encoding to transform this column numerically.
#Use Pandas' get_dummies() function to do so.

# Transform the category_desc column
#category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
#print(category_enc.head())

#################################################
#<script.py> output:
#       Education            ...              Strengthening Communities
#    0          0            ...                                      0
#    1          0            ...                                      1
#    2          0            ...                                      1
#    3          0            ...                                      1
#    4          0            ...                                      0
#
#    [5 rows x 6 columns]
#
#################################################
#get_dummies() is a simple and quick way to encode categorical variables.