# 2.3. Get Your Data Machine Learning Ready for Classification: Data Wrangling

## Preparing the Data

### **Table of Contents**  

- [1. Data Import](#1)  
- [2. Data Wangling](#2)
  - [2.1 Response Variable Distribution](#21)
  - [2.2 Rare Classes in Features](#22)
  - [2.3 Noninformative features](#23)
- [3. Data Splitting](#3)
  - [3.1 Full Data to Training and Testing](#31)
  - [3.2 Training to Build and Validation](#32)
- [4. Addressing Class Imbalance](#4)

<a id="1"></a>
## 1 Data Import

In Course 1, Module 3: *Magic Pandas Library: Mastering Higher Education Data Preparation and Analysis*, we learned how to merge data that originated from multiple sources accross campus. The High School, Enrollment, Admissions, Course and Completion datasets all provide valuable information to assist us in our effort to predict student metrics in future semesters. As you recall, we've selected a subset of the variables from these data to include in the modeling phase. These include:
1. Academic Performance Data

      - Available at time of admission: high school GPAs

      - Available at time of modeling: units attempted, completed and DFW, and available postsecondary GPAs  
2. Demographic Data
      - Gender, ethnicity, first gen status

3. The target variable, **SEM_3_STATUS**, a qualitative variable coded as follows:

| Code | Meaning |
|---|---|
|E |Enrolled |
|N |Not Enrolled |
|G |Graduated |


Let's load the necessary Python libraries to import the data and start to process it for analysis:

In [None]:
import pandas as pd
import warnings

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

Now let's import the **ml_data** data we've curated. Then, by typing the name we assign it, we can scope out the top and bottom 5 rows of the DataFrame and view its basic attributes in detail:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv'

df = pd.read_csv(file_path)
display(df.head())

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv'

<a id="2"></a>
## 2 Data Wrangling

Data availability is a necessary condition for data analysis, but it is not sufficient. There are a number of modifications we need to make to the data to prepare it for machine learning. The process of preparing the data for exploration and modeling is known as **data wrangling**, and will be performed here.
To answer Shontelle's question, we need to build a model using cohorts for which term 3 grade data has already been collected. Thus our response variable will be based on the SEM_3_STATUS variable. Let's dig deeper.

<a id="21"></a>

#### 2.1 Response Variable Distribution

Recall that this DataFrame consists of three cohorts: Fall 2021, Fall 2022 and Fall 2023. The cohort sizes may be identified as follows:

In [None]:
df['COHORT'].value_counts()

Unnamed: 0_level_0,count
COHORT,Unnamed: 1_level_1
Fall 2022,5363
Fall 2019,5170
Fall 2018,4954
Fall 2020,4910
Fall 2021,4866


The code below groups the retention DataFrame by 'COHORT' and 'SEM_3_STATUS' columns, counts the number of occurrences for each category, and resets the index, renaming the count column to 'COUNTS'.


In [None]:
df[['COHORT', 'SEM_3_STATUS']].groupby(['COHORT', 'SEM_3_STATUS']).size().reset_index(name='COUNTS')

Unnamed: 0,COHORT,SEM_3_STATUS,COUNTS
0,Fall 2018,E,4307
1,Fall 2018,N,647
2,Fall 2019,E,4583
3,Fall 2019,N,587
4,Fall 2020,E,4239
5,Fall 2020,N,671
6,Fall 2021,E,4133
7,Fall 2021,N,733
8,Fall 2022,E,4540
9,Fall 2022,N,823


### Data Quality Assurance

We may proceed by investigating data quality issues in our DataFrame that could affect our analysis. These include

 - Drop non-informative features
 - Missing values in features
 - Rare classes in features
 - Noninformative features

<a id="23"></a>

#### Noninformative features

##### Drop unnecessary or redundant columns outside of analysis scope

In [None]:
df.drop(['HS_GPA', 'SEM_1_STATUS', 'SEM_2_STATUS'], axis=1, inplace=True)

<a id="4"></a>
### Addressing Missingness

As mentioned previously, an essential data preprocessing step for modeling in scikit learn is accounting for missingness in our observations. Scikit learn models will not run with missing data, so we need to decide how to deal with it.
Let's investigate missingness in our dataset, and use that to determine the most effective way to proceed:

To check for missing values in a Pandas DataFrame, we can use the command `df.isnull().sum()`. The output of this command shows a large number of missing values in our data. This is expected, as high school data is not available for many students. While missing data can sometimes be ignored during exploratory data analysis, it must be addressed for predictive modeling using libraries like statsmodels and scikit-learn, which require complete data.



In [None]:
display(df.isnull().sum())

Unnamed: 0,0
SID,0
COHORT,0
RACE_ETHNICITY,0
GENDER,0
FIRST_GEN_STATUS,0
FAMILY_INCOME,20967
HS_MATH_GPA,359
HS_ENGL_GPA,359
COLLEGE,0
UNITS_ATTEMPTED_1,130


Before proceeding, we need to decide how to handle these missing values. We have three main options:

1.  **Exclude variables with excessive missingness:** Remove entire columns that have missing values above a threshold
2.  **Impute values:** Fill in missing values with estimated or plausible values.
3.  **Remove observations:** Delete all rows that contain any missing values.



## Addressing Missing Values: Why Imputation Follows Data Splitting

Handling missing data is a key step in preparing your dataset for machine learning. Here we identified missing values and dropped columns with over 50% missing data, which impacts how duplicates are found.

### Impact of Dropping High Missingness Columns

Dropping columns with a high percentage of missing values affects subsequent data cleaning steps, including duplicate detection. If this step were performed *after* splitting the data, the training and testing sets might have different sets of columns dropped based on their individual missingness profiles. This could lead to inconsistencies between the training and testing data, potentially impacting model performance and interpretability. Performing this removal *before* splitting ensures that both the training and testing sets have the same set of features based on the overall data's missingness patterns. This consistent feature set then allows for more reliable duplicate detection based on the remaining, more complete columns across the entire dataset before the split.

### Imputation and Data Leakage

After handling high missingness and duplicates, remaining missing values need imputation. To prevent **data leakage**, it's vital to impute *after* splitting data into training and testing sets.

Data leakage occurs when test set information influences the training process. If imputation values (like means) are calculated using the entire dataset before splitting, information from the test set leaks into the training set.

**Imputing After Splitting:**

*   Calculate imputation values (e.g., mean) using *only* the training data.
*   Fill missing values in the training set using these training-based values.
*   Fill missing values in the testing set using the *same* values calculated from the training set.

This approach ensures the model learns to handle missing data based only on the training set's patterns, accurately reflecting how it would perform on new, unseen data. Imputing before splitting can lead to an overestimation of model performance on the test set.

Therefore, imputing missing values after splitting is crucial for preventing data leakage and getting a realistic measure of your model's ability to generalize.

#### ***Exclude variables with excessive missingness***: Identify columns with more than 50% missing values and drop them from the dataframe.



In [None]:
missing_values_count = df.isnull().sum()
total_rows = len(df)
columns_to_drop = missing_values_count[missing_values_count / total_rows > 0.5].index.tolist()
df.drop(columns=columns_to_drop, inplace=True)
display(f"Number of remaining columns: {df.shape[1]}")
display(df.head())

'Number of remaining columns: 18'

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,UNITS_ATTEMPTED_1,UNITS_ATTEMPTED_2,UNITS_COMPLETED_1,UNITS_COMPLETED_2,DFW_UNITS_1,DFW_UNITS_2,GPA_1,GPA_2,GPA_3,SEM_3_STATUS
0,UHDOP5522,Fall 2020,Asian,Female,Continuing Generation,3.2,3.4,Visual & Performing Arts,15.0,14.0,15.0,15.0,0.0,0.0,4.0,3.785714,4.0,E
1,UHE842CU6,Fall 2021,Black or African American,Female,Continuing Generation,2.6,3.75,Visual & Performing Arts,12.0,12.0,12.0,12.0,3.0,4.0,3.0,2.5,1.5,E
2,UHJFT1JAB,Fall 2018,Asian,Female,Continuing Generation,3.4,3.5,Visual & Performing Arts,15.0,15.0,15.0,16.0,0.0,0.0,3.8,3.6,3.6,E
3,UHKF05TAF,Fall 2018,Hispanic,Female,First Generation,3.0,3.375,Letters & Humanities,16.0,9.0,7.0,3.0,9.0,9.0,1.5625,1.0,2.5,E
4,UHKKQ8UY5,Fall 2021,Hispanic,Male,Continuing Generation,2.5,2.625,Letters & Humanities,13.0,13.0,13.0,13.0,0.0,0.0,3.538462,3.769231,3.4,E


<a id="22"></a>
#### 2.2 Rare Classes in Categorical Features

Let's take a look at the distribution of values in our qualitative variables. If it turns out that there are some values that are rare, they could cause issues with our downstream data processing. One way to avoid this is to consolidate rare classes into one. Note that consolidating or dropping variables is not a reflection of their importance or relevance to the analysis; instead they highlight one of the limitations of machine learning and the importance of human oversight to create a legitimate representation of the truth.

Inspect the unique values and their counts for the categorical columns to identify any anomalies or labels that need fixing.



In [None]:
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    display(f"Value counts for column: {col}")
    display(df[col].value_counts())

'Value counts for column: SID'

Unnamed: 0_level_0,count
SID,Unnamed: 1_level_1
P9CHKVJ7X,2
0W430472L,2
FS2AVPW1M,2
FY9HZPJ60,2
MWYLJAF1S,2
...,...
Z8UOMJIWF,1
Z8UOB0XEW,1
Z8UITDMUC,1
Z8UF1Z7WR,1


'Value counts for column: COHORT'

Unnamed: 0_level_0,count
COHORT,Unnamed: 1_level_1
Fall 2022,5363
Fall 2019,5170
Fall 2018,4954
Fall 2020,4910
Fall 2021,4866


'Value counts for column: RACE_ETHNICITY'

Unnamed: 0_level_0,count
RACE_ETHNICITY,Unnamed: 1_level_1
Hispanic,12359
Asian,5995
White,3481
Two or More Races,1190
Nonresident alien,925
Black or African American,911
Unknown,318
Native Hawaiian or Other Pacific Islander,61
American Indian or Alaska Native,23


'Value counts for column: GENDER'

Unnamed: 0_level_0,count
GENDER,Unnamed: 1_level_1
Female,15119
Male,10014
Nonbinary,38
Female,26
female,26
Male,20
male,20


'Value counts for column: FIRST_GEN_STATUS'

Unnamed: 0_level_0,count
FIRST_GEN_STATUS,Unnamed: 1_level_1
Continuing Generation,15735
First Generation,7384
Unknown,2144


'Value counts for column: COLLEGE'

Unnamed: 0_level_0,count
COLLEGE,Unnamed: 1_level_1
Health & Human Services,4581
Engineering & Technology,4051
General Studies,3952
Letters & Humanities,3707
Natural and Mathematical Sciences,2868
Business Administration,2840
Visual & Performing Arts,2752
Education & Leadership,512


'Value counts for column: SEM_3_STATUS'

Unnamed: 0_level_0,count
SEM_3_STATUS,Unnamed: 1_level_1
E,21802
N,3461


First, inconsistent labels in the 'GENDER' column are fixed by converting all entries to a consistent case and removing leading/trailing spaces. Then, rare categories are combined or eliminated, such as the `Nonbinary` class in the `GENDER` feature. Finally, observing the `RACE_ETHNICITY` feature, it is decided to consolidate the 'Unknown', 'Native Hawaiian or Other Pacific Islander', and 'American Indian or Alaska Native' classes into one new 'Other' class.



Fix inconsistent labels in the **GENDER** feature by converting all entries to a consistent case and removing leading/trailing spaces.

In [None]:
df['GENDER'] = df['GENDER'].str.strip().str.capitalize()
display(df['GENDER'].value_counts())

Unnamed: 0_level_0,count
GENDER,Unnamed: 1_level_1
Female,15171
Male,10054
Nonbinary,38


Drop the rare Non-binary class in **GENDER**:

In [None]:
df = df[df['GENDER'] != 'Non-binary']

For **RACE_ETHNICITY** compbine Unknown, Native Hawaiian or Other Pacific Islander, and American Indian or Alaska Native into the category Other


In [None]:
df['RACE_ETHNICITY'] = df['RACE_ETHNICITY'].replace(['Unknown', 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native'], 'Other')

## Eliminating Duplicates: Ensuring Data Integrity for Machine Learning

Identifying and removing duplicate rows is a vital part of the data cleaning process in a machine learning workflow. Duplicate data can significantly impact the performance and reliability of your model in several ways, and addressing them early is crucial.

Here's why eliminating duplicate rows is necessary:

### 1. Skewed Model Training

When identical rows are present in your dataset, your machine learning model effectively sees the same information multiple times. This can lead to the model being overly influenced by the patterns present in the duplicated rows. The model might learn to predict based on the repeated instances rather than the underlying, unique patterns in the data. This can result in a model that performs well on the training data (because it has seen those examples repeatedly) but generalizes poorly to new, unseen data that doesn't contain the same duplications.

### 2. Inflated Evaluation Metrics

Duplicate rows can also artificially inflate your model's evaluation metrics. If duplicate data exists in both your training and testing sets (which can happen if you don't remove them before splitting), the model might correctly predict the outcome for a duplicated test instance simply because it learned that exact instance during training. This gives a false impression of the model's ability to generalize to novel data. Metrics like accuracy, precision, and recall can appear higher than they truly are.

### 3. Misleading Data Distribution

Duplicate rows distort the true distribution of your data. For example, if a specific type of observation is duplicated many times, it will appear more frequent in the dataset than it is in reality. This can mislead exploratory data analysis and influence decisions about feature engineering or model selection based on an inaccurate understanding of the data's characteristics.

### 4. Increased Training Time and Resource Usage

While less critical than model performance issues, duplicate rows also add unnecessary complexity to your dataset. Training a model on a larger dataset with duplicates takes more time and computational resources without adding valuable, unique information. Removing duplicates can lead to more efficient training.

By eliminating duplicate rows, you ensure that your model learns from a dataset where each observation represents unique information. This leads to a more accurate representation of the data's underlying patterns, prevents the model from being biased by repeated instances, and provides a more reliable evaluation of its performance on unseen data. It's a fundamental step towards building a robust and generalizable machine learning model.

Check for duplicate rows in the DataFrame.



In [None]:
df.duplicated().sum()

np.int64(65)

Drop duplicate rows from the DataFrame.



In [None]:
df.drop_duplicates(inplace=True)
display(df.duplicated().sum())

np.int64(0)

<a id="3"></a>
## 3 Data Splitting


As mentioned in *Module 3: Explaining the Machine Learning Cycle Without Hyperparameter Tuning* we observed that a learning algorithm is only useful to the extent that we can confidently apply it to unseen data to make accurate predictions. The ability to generalize is measured by an investigation of model performance on a random sample of the full data called the test set. Before we explore or analyze our data it is imperative that we split it into a training and test set. This step will reintroduce us to Python's machine learning powerhouse, **[scikit learn](https://scikit-learn.org/stable/index.html)**.

<a id="31"></a>
#### 3.1 Full Data to Training and Testing

Data splitting is one of the most important steps of the machine learning cycle. We've all had instructors that, let's just say, provided alot of friendly *guidance* for what material would appear on an exam (they were pretty popular professors). Often this was in the form of a "practice exam". This led to a scenario where the exam was for all intents and purposes observed before exam day, and those who could memorize well were likely to achieve the most success. As much as stressed out college students might enjoy it, this arrangement does not facilitate genuine learning, which is demonstrated by the ability to accurately generalize concepts and constructs to new scenarios.  This is why we split data. So that instead of memorizing content and being tested on how well we can repeat it, we are attempting to learn the "how" and "why" behind the data generating process so that when new data comes from the process, we can legitimately demonstrate a deep level of understanding. Splitting the data into a train set an a test set, and not using the test set at all to learn patterns in the data will enable our model to demonstrate this deeper understanding. Let's load the **train_test_split** module from the scikit learn library and get our study on!

In [None]:
#Class for data splitting
from sklearn.model_selection import train_test_split

Figure 1 displays the first step of the data splitting process: identify and isolate the feature matrix (\\(X)\\) and label vector (\\(y)\\) in the context of an easy to visualize dataframe. The figure is followed by the code that gets this process started.

**Figure 1:** Seperating our curated DataFrame into a feature matrix \\((X)\\) and label vector \\((y\\)). An example with a DataFrame with 15 observations.


![ih](../public/figures/Xy_pic_2-3.png)

Next, let's create the feature matrix by removing the target and identifier variables.

In [None]:
#Creating the feature matrix
X = training.drop(['SID','COHORT','SEM_2_STATUS'],axis=1)

For the target variable, we need a column in which 1 represents students who leave in semester 3, and 0 represents students who were retained. Thus we need to **one hot encode** the "NR" class in our target:

In [None]:
#The one hot encoding for the NR class
y = training['SEM_2_STATUS'].apply(lambda x: 1 if x == 'NR' else 0)

The initial split was a vertical one, seperating features from label. We proceed with a horizontal split, randomly holding out a specified percentage of observations for testing.

Let's create an 80-20 split of the data for training, and testing on an unlearned hold out set. One of the most useful functions in scikit learn, **[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)** gets the job done in one line of code:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=rms)

#The random_state variable makes the code reproducible - everytime we run this code, the same observations will be allocated to the test set.

**Figure 2:** An example of an 80%-20% train-test split on a dataframe with 20 observations. Randomly sample 20% * 20 = 4 values to hold out for model testing: Observations 2,6,13 and 19.

![ih](https://github.com/ksuaray/IRML---Regression-and-Classification/blob/MLCert-Sketches/MLCert%20Sketches%202/80-20-Xy.png?raw=true)

From our original 10,280 observations, 20% \\(\times\\) 10,280 \\(\approxeq\\) 2056 will be reserved for model testing. To prevent *data leakage*, they will not be part of our data exploration or model fitting whatsoever; we don't want to peek at the test before exam day, right?

In [None]:
print(X_train.shape,X_test.shape)

(8196, 13) (2049, 13)


### 3.2 Cohort-Based Splitting

- Imagine you're trying to predict whether students in the **Fall 2022** cohort will drop out — but your model has only seen data from earlier semesters.

- This is a **real-world scenario**: using historical data to make future predictions.

- Instead of using a random mix of students, we deliberately separate the latest cohort (Fall 2022) to simulate **how well our model performs on new, unseen students**.

- This is known as **cohort-based splitting**, and it's a more realistic evaluation when time or group differences matter.


In [None]:
# Let's separate the test set to only include students from 'Fall 2022' cohort
df_test_cohort = df[df['COHORT'] == 'Fall 2022'].copy()

# The training set will include all students from earlier cohorts
df_train_cohort = df[df['COHORT'] != 'Fall 2022'].copy()

# Show shapes of both datasets
print(f"Training Data Shape (Cohort-Based): {df_train_cohort.shape}")
print(f"Testing Data Shape (Fall 2022 Cohort): {df_test_cohort.shape}")


In this case, the training set includes all students **except** those in the Fall 2022 cohort. The test set contains **only** Fall 2022 students.

This allows us to test how well the model trained on past students performs on a **completely new group**. It's like training a tutor on last year's students and then seeing how well they guide this year's students.


In [None]:
# Let’s preview 3 rows from each set to understand the data better
print("Training Data Sample (Before Fall 2022):")
display(df_train_cohort.head(3))

print("\nTesting Data Sample (Only Fall 2022):")
display(df_test_cohort.head(3))


# 3.3 Method Comparison

**Method 1: Random Split (train_test_split)**

Pros:

- Ensures that both the training and testing sets are representative of the overall data distribution.
- Simple to implement and a standard practice in machine learning.
- Avoids potential biases that could arise from non-random splits.

Cons:

- May not be ideal if one needs to evaluate the model's performance on a specific, future cohort.
- If there are significant differences between cohorts, a randomly split test set might not reflect real-world performance.

**Method 2: Cohort-Based Split (COHORT == 'Fall 2022')**

Pros:

- Provides a realistic evaluation of how the model would perform on a specific group, such as the most recent cohort.
- Allows you to assess the model's ability to generalize to a cohort that may have different characteristics.

Cons:

- The test set may not be representative of the overall data distribution if the chosen cohort is significantly different.
- If the chosen test cohort is significantly different from the training cohorts, the model's performance might appear worse than it actually is.
- Reduces the size of the training data, which could impact model performance, especially for smaller datasets.

**When to use which method:**

- Use random splitting when you want to build a model that generalizes well to new data from the same population.
- Use cohort-based splitting when you need to specifically evaluate your model's performance on a particular group or time period.

# 4 Data Imputation



After splitting the data into training and testing sets, we need to make sure the datasets are **clean and complete** before we feed them into a machine learning model. Real-world data often contains **missing values** — cells that are empty or labeled as NaN (`Not a Number`). These can cause problems during training because most algorithms can’t handle missing values out-of-the-box.

Let’s walk through how to handle (or "impute") these missing values in a smart and consistent way.


## 4.1 Identifying Missing Data

This code helps us find out which columns in the training data are missing values and how many values are missing in each.
Identifying where the missing data exists is the first step before deciding how to fix it.


In [None]:
# Look for columns in df_train that have missing values
missing_train = df_train.isnull().sum()
cols_with_missing_train = missing_train[missing_train > 0].index
display(cols_with_missing_train)


## 4.2 Imputing Missing Values in the Training Set

- If a column contains numerical data, we fill the missing values using the median. This prevents extreme values (outliers) from skewing the imputation.

- If a column contains categorical data (like labels or categories), we fill missing values using the mode (the most common value).

We compute these values only from the training set. This is important because using test data during training can introduce bias (known as data leakage).

In [None]:
# For each column with missing values in df_train:
for col in cols_with_missing_train.index:
    if df_train[col].dtype in ['int64', 'float64']:
        # For numerical columns, fill missing values with the column's median
        df_train[col].fillna(df_train[col].median(), inplace=True)
    else:
        # For categorical columns, fill missing values with the most frequent value (mode)
        df_train[col].fillna(df_train[col].mode()[0], inplace=True)


## 4.3 Imputing Missing Values in the Test Set

When imputing the test set, we do not calculate new statistics.
Instead, we reuse the same median and mode values from the training set. This simulates a real-world deployment, where the model sees only new data but relies on patterns learned from the past.

In [None]:
# Now do the same for df_test, but use training data statistics
missing_test = df_test.isnull().sum()
cols_with_missing_test = missing_test[missing_test > 0].index

for col in cols_with_missing_test.index:
    if df_train[col].dtype in ['int64', 'float64']:
        df_test[col].fillna(df_train[col].median(), inplace=True)
    else:
        df_test[col].fillna(df_train[col].mode()[0], inplace=True)


## 4.4 Imputing Missing Values in Cohort-Based Splits

Now we apply the same imputation process to our **cohort-based training and testing datasets**. This ensures both random-split and cohort-split versions of the data are clean and consistent before we proceed.

We use the **same function** to maintain consistency and reusability.

In [None]:
df_train_cohort, df_test_cohort = impute_missing_values(df_train_cohort, df_test_cohort)
