<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Common-Functions" data-toc-modified-id="Common-Functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Common Functions</a></span><ul class="toc-item"><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Data preprocessing</a></span><ul class="toc-item"><li><span><a href="#Identifying-identifiers" data-toc-modified-id="Identifying-identifiers-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Identifying identifiers</a></span></li><li><span><a href="#Identifying-missing-values" data-toc-modified-id="Identifying-missing-values-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Identifying missing values</a></span></li><li><span><a href="#Identifying-categorical-variables-(features-and-target)" data-toc-modified-id="Identifying-categorical-variables-(features-and-target)-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Identifying categorical variables (features and target)</a></span></li></ul></li></ul></li></ul></div>

<b>
<p>
<center>
<font size="5">
Popular Machine Learning Methods: Idea, Practice and Math
</font>
</center>
</p>
    
<p>
<center>
<font size="4">
Utilities: Shallow Learning
</font>
</center>
</p>

<p>
<center>
<font size="3">
Data Science, Columbian College of Arts & Sciences, George Washington University
</font>
</center>
</p>

<p>
<center>
<font size="3">
Yuxiao Huang
</font>
</center>
</p>
</b>

# Overview

- This notebook includes some common functions used in PMLM.
- Concretely, these functions are used for:
    - data preprocessing
    - plot
- See the accompanied slides in our [github repository](https://github.com/yuxiaohuang/teaching/tree/master/gwu/machine_learning_I/fall_2020/slides).

# Common Functions

## Data preprocessing

### Identifying identifiers
The code below shows how to find *Identifiers* (a feature whose value is unique for each sample) from data.

In [None]:
def id_checker(df, dtype='float'):
    """
    The identifier checker

    Parameters
    ----------
    df : dataframe
    dtype : the data type identifiers cannot have, 'float' by default
            i.e., if a feature has this data type, it cannot be an identifier
    
    Returns
    ----------
    The dataframe of identifiers
    """
    
    # Get the dataframe of identifiers
    df_id = df[[var for var in df.columns
                # If the data type is not dtype
                if (df[var].dtype != dtype
                    # If the value is unique for each sample
                    and df[var].nunique(dropna=True) == df[var].notnull().sum())]]
    
    return df_id

### Identifying missing values

The code below shows how to find variables with NaN, their proportion of NaN and data type.

In [None]:
def nan_checker(df):
    """
    The NaN checker

    Parameters
    ----------
    df : dataframe
    
    Returns
    ----------
    The dataframe of variables with NaN, their proportion of NaN and data type
    """
    
    # Get the dataframe of variables with NaN, their proportion of NaN and data type
    df_nan = pd.DataFrame([[var, df[var].isna().sum() / df.shape[0], df[var].dtype]
                           for var in df.columns if df[var].isna().sum() > 0],
                          columns=['var', 'proportion', 'dtype'])
    
    # Sort df_nan in accending order of the proportion of NaN
    df_nan = df_nan.sort_values(by='proportion', ascending=False).reset_index(drop=True)
    
    return df_nan

### Identifying categorical variables (features and target)

The code below shows how to find categorical variables (whose data type is dtype) and their number of unique value.

In [None]:
def cat_var_checker(df, dtype='object'):
    """
    The categorical variable checker

    Parameters
    ----------
    df : the dataframe
    dtype : the data type categorical variables should have, 'object' by default
            i.e., if a variable has this data type, it should be a categorical variable
    
    Returns
    ----------
    The dataframe of categorical variables and their number of unique value
    """
    
    # Get the dataframe of categorical variables and their number of unique value
    df_cat = pd.DataFrame([[var, df[var].nunique(dropna=False)]
                           # If the data type is dtype
                           for var in df.columns if df[var].dtype == dtype],
                          columns=['var', 'nunique'])
    
    # Sort df_cat in accending order of the number of unique value
    df_cat = df_cat.sort_values(by='nunique', ascending=False).reset_index(drop=True)
    
    return df_cat