# 2.3. Get Your Data Machine Learning Ready for Classification: Data Wrangling

## Preparing the Data

### **Table of Contents**  

- [1. Data Import](#1)  
- [2. Data Wangling](#2)
  - [2.1 Response Variable Distribution](#21)
  - [2.2 Rare Classes in Features](#22)
  - [2.3 Noninformative features](#23)
- [3. Data Splitting](#3)
  - [3.1 Full Data to Training and Testing](#31)
  - [3.2 Training to Build and Validation](#32)
- [4. Addressing Class Imbalance](#4)

<a id="1"></a>
## 1 Data Import

In Course 1, Module 3: *Magic Pandas Library: Mastering Higher Education Data Preparation and Analysis*, we learned how to merge data that originated from multiple sources accross campus. The High School, Enrollment, Admissions, Course and Completion datasets all provide valuable information to assist us in our effort to predict student metrics in future semesters. As you recall, we've selected a subset of the variables from these data to include in the modeling phase. These include:
1. Academic Performance Data

      - Available at time of admission: high school GPAs

      - Available at time of modeling: units attempted, completed and DFW, and available postsecondary GPAs  
2. Demographic Data
      - Gender, ethnicity, first gen status

3. The target variable, **SEM_3_STATUS**, a qualitative variable coded as follows:

| Code | Meaning |
|---|---|
|E |Enrolled |
|N |Not Enrolled |
|G |Graduated |


Let's load the necessary Python libraries to import the data and start to process it for analysis:

In [None]:
import pandas as pd
import warnings

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

Now let's import the **ml_data** data we've curated. Then, by typing the name we assign it, we can scope out the top and bottom 5 rows of the DataFrame and view its basic attributes in detail:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [12]:
file_path = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv'

df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,FAMILY_INCOME,HS_GPA,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,...,UNITS_COMPLETED_1,UNITS_COMPLETED_2,DFW_UNITS_1,DFW_UNITS_2,GPA_1,GPA_2,GPA_3,SEM_1_STATUS,SEM_2_STATUS,SEM_3_STATUS
0,UHDOP5522,Fall 2020,Asian,Female,Continuing Generation,,3.72,3.2,3.4,Visual & Performing Arts,...,15.0,15.0,0.0,0.0,4.0,3.785714,4.0,E,E,E
1,UHE842CU6,Fall 2021,Black or African American,Female,Continuing Generation,,3.189,2.6,3.75,Visual & Performing Arts,...,12.0,12.0,3.0,4.0,3.0,2.5,1.5,E,E,E
2,UHJFT1JAB,Fall 2018,Asian,Female,Continuing Generation,,3.625,3.4,3.5,Visual & Performing Arts,...,15.0,16.0,0.0,0.0,3.8,3.6,3.6,E,E,E
3,UHKF05TAF,Fall 2018,Hispanic,Female,First Generation,,3.606,3.0,3.375,Letters & Humanities,...,7.0,3.0,9.0,9.0,1.5625,1.0,2.5,E,E,E
4,UHKKQ8UY5,Fall 2021,Hispanic,Male,Continuing Generation,50K<,3.536,2.5,2.625,Letters & Humanities,...,13.0,13.0,0.0,0.0,3.538462,3.769231,3.4,E,E,E


<a id="2"></a>
## 2 Data Wrangling

Data availability is a necessary condition for data analysis, but it is not sufficient. There are a number of modifications we need to make to the data to prepare it for machine learning. The process of preparing the data for exploration and modeling is known as **data wrangling**, and will be performed here.
To answer Shontelle's question, we need to build a model using cohorts for which term 3 grade data has already been collected. Thus our response variable will be based on the SEM_2_STATUS variable. Let's dig deeper.

<a id="21"></a>
#### 2.1 Response Variable Distribution

Recall that this DataFrame consists of three cohorts: Fall 2021, Fall 2022 and Fall 2023. The cohort sizes may be identified as follows:

In [13]:
df['COHORT'].value_counts()

Unnamed: 0_level_0,count
COHORT,Unnamed: 1_level_1
Fall 2022,5363
Fall 2019,5170
Fall 2018,4954
Fall 2020,4910
Fall 2021,4866


The code below groups the retention DataFrame by 'COHORT' and 'SEM_3_STATUS' columns, counts the number of occurrences for each category, and resets the index, renaming the count column to 'COUNTS'.


In [14]:
df[['COHORT', 'SEM_3_STATUS']].groupby(['COHORT', 'SEM_3_STATUS']).size().reset_index(name='COUNTS')

Unnamed: 0,COHORT,SEM_3_STATUS,COUNTS
0,Fall 2018,E,4307
1,Fall 2018,N,647
2,Fall 2019,E,4583
3,Fall 2019,N,587
4,Fall 2020,E,4239
5,Fall 2020,N,671
6,Fall 2021,E,4133
7,Fall 2021,N,733
8,Fall 2022,E,4540
9,Fall 2022,N,823


We observe the distribution of 'SEM_2_STATUS' and determine that for Fall 2023 only the value 'NR' is observed. This means that the Fall 2023 cohort is "unseasoned," meaning not enough time has elapsed since the Fall of 2023 for students to continue or graduate. This means that we should use the Fall 2021 and Fall 2022 cohorts to train our machine learning models. Then we can use the Fall 2023 cohort to make predictions.

We may proceed by investigating data quality issues in our DataFrame that could affect our analysis. These include

 - Missing values in features
 - Rare classes in features
 - Noninformative features
 - Class imbalance in target


# Drop unnecessary columns outside of analysis scope

Since the scope of the analysis is to predict dropout at the 3rd semester, we drop SEM_1_STATUS and SEM_2_STATUS

In [15]:
df.drop(['SEM_1_STATUS', 'SEM_2_STATUS'], axis=1, inplace=True)

<a id="4"></a>
# Addressing Missingness

As mentioned previously, an essential data preprocessing step for modeling in scikit learn is accounting for missingness in our observations. Scikit learn models will not run with missing data, so we need to decide how to deal with it.
Let's investigate missingness in our dataset, and use that to determine the most effective way to proceed:

To check for missing values in a Pandas DataFrame, we can use the command `df.isnull().sum()`. The output of this command shows a large number of missing values in our data. This is expected, as high school data is not available for many students. While missing data can sometimes be ignored during exploratory data analysis, it must be addressed for predictive modeling using libraries like statsmodels and scikit-learn, which require complete data.



In [16]:
display(df.isnull().sum())

Unnamed: 0,0
SID,0
COHORT,0
RACE_ETHNICITY,0
GENDER,0
FIRST_GEN_STATUS,0
FAMILY_INCOME,20967
HS_GPA,124
HS_MATH_GPA,359
HS_ENGL_GPA,359
COLLEGE,0


Before proceeding, we need to decide how to handle these missing values. We have three main options:

1.  **Remove observations:** Delete all rows that contain any missing values.
2.  **Impute values:** Fill in missing values with estimated or plausible values.
3.  **Exclude variables:** Remove entire columns that have missing values from the analysis.

For the current analysis, we choose option 1: drop incomplete observations. Our goal is to create a model that utilized data that is available for the typical domestic applicant. We can investigate missingness below:

Several hundred missing observations entails alot of data to drop, but it is a necessary step if we want to proceed by incorporating high school data in our model, thus avoiding option 3. above. To whatever extent possible, we should use domain knowledge or critical investigation to ascertain *why* data are missing, as this has massive implications for model bias and generalizability. The primary framework for understanding missingness has three possibilities:

  1. MCAR - Missing Completely at Random - reasons for missingness are unrelated to any observed variables

  2. MAR - Missing at Random - reasons for missingness in a specific variable are unrelated to that variable, and due to some other observed variable

  3. MNAR - Missing not at Random - reasons for missingness in a specific variable are related directly to that variable

In this scenario, it is most likely that high school is data for observations corresponding to international students, students who were homeschooled or went to a private school. As such, if we restrict our population of interest (and thus scope of our model implementation) to exclude these demographics, bias is mitigated if we drop observations with missing data.
In addition, in anticipation of our inclusion of DFW rate, let's remove any observations with 0 units attempted in terms 1 and 2.
To enable use in our model, we'll need to do the same with the test data (without explicitly viewing it, of course).  Let's take a look at the complete training data:

# Check for the number of missing values in each column.



# Identify columns with more than 50% missing values and drop them from the dataframe.



In [9]:
missing_values_count = df.isnull().sum()
total_rows = len(df)
columns_to_drop = missing_values_count[missing_values_count / total_rows > 0.5].index.tolist()
df.drop(columns=columns_to_drop, inplace=True)
display(f"Number of remaining columns: {df.shape[1]}")
display(df.head())

'Number of remaining columns: 19'

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,HS_GPA,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,UNITS_ATTEMPTED_1,UNITS_ATTEMPTED_2,UNITS_COMPLETED_1,UNITS_COMPLETED_2,DFW_UNITS_1,DFW_UNITS_2,GPA_1,GPA_2,GPA_3,SEM_3_STATUS
0,UHDOP5522,Fall 2020,Asian,Female,Continuing Generation,3.72,3.2,3.4,Visual & Performing Arts,15.0,14.0,15.0,15.0,0.0,0.0,4.0,3.785714,4.0,E
1,UHE842CU6,Fall 2021,Black or African American,Female,Continuing Generation,3.189,2.6,3.75,Visual & Performing Arts,12.0,12.0,12.0,12.0,3.0,4.0,3.0,2.5,1.5,E
2,UHJFT1JAB,Fall 2018,Asian,Female,Continuing Generation,3.625,3.4,3.5,Visual & Performing Arts,15.0,15.0,15.0,16.0,0.0,0.0,3.8,3.6,3.6,E
3,UHKF05TAF,Fall 2018,Hispanic,Female,First Generation,3.606,3.0,3.375,Letters & Humanities,16.0,9.0,7.0,3.0,9.0,9.0,1.5625,1.0,2.5,E
4,UHKKQ8UY5,Fall 2021,Hispanic,Male,Continuing Generation,3.536,2.5,2.625,Letters & Humanities,13.0,13.0,13.0,13.0,0.0,0.0,3.538462,3.769231,3.4,E


<a id="22"></a>
#### 2.2 Rare Classes in Features

Let's take a look at the distribution of values in our qualitative variables. If it turns out that there are some values that are rare, they could cause issues with our downstream data processing. One way to avoid this is to consolidate rare classes into one. Note that consolidating or dropping variables is not a reflection of their importance or relevance to the analysis; instead they highlight one of the limitations of machine learning and the importance of human oversight to create a legitimate representation of the truth.

Let's investigate the class distribution for **RACE_ETHNICITY** and consolidate rare occurences into an 'Other' class:

In [None]:
#Distribution of classes in RACE_ETHNICITY
pd.Series(retention['RACE_ETHNICITY']).value_counts()

RACE_ETHNICITY
Hispanic/Latino                              8407
Asian                                        3510
White                                        2030
Two or More Races                             735
Black or African American                     608
Visa Non-U.S.                                 480
Unknown                                       206
Native Hawaiian or Other Pacific Islander      34
American Indian or Alaska Native               15
Name: count, dtype: int64

Let's consolidate the Unknown, Native Hawaiian or Other Pacific Islander, and American Indian or Alaska Native classes into one new 'Other' class:

In [None]:
#Consolidating the three smallest classes into one 'Other' class
condition = (retention['RACE_ETHNICITY'] == 'American Indian or Alaska Native') | \
            (retention['RACE_ETHNICITY'] == 'Native Hawaiian or Other Pacific Islander') | \
            (retention['RACE_ETHNICITY'] == 'Unknown')

# Get the original indices where the condition is true using loc
indices_true = retention.loc[condition].index

# Update 'RACE_ETHNICITY' to 'Other' for rows with true condition
retention.loc[indices_true, 'RACE_ETHNICITY'] = 'Other'

pd.Series(retention['RACE_ETHNICITY']).value_counts()

RACE_ETHNICITY
Hispanic/Latino              8407
Asian                        3510
White                        2030
Two or More Races             735
Black or African American     608
Visa Non-U.S.                 480
Other                         255
Name: count, dtype: int64

Investigating **FIRST_GEN_STATUS**, it is clear that there are no rare classes:

In [None]:
#Distribution of classes in FIRST_GEN_STATUS
pd.Series(retention['FIRST_GEN_STATUS']).value_counts()

FIRST_GEN_STATUS
Continuing Generation    9926
First Generation         4653
Unknown                  1446
Name: count, dtype: int64

Finally, for **GENDER**, we drop the rare Non-binary class:

In [None]:
#Distribution of classes in GENDER
pd.Series(retention['GENDER']).value_counts()

GENDER
Female       9778
Male         6192
Nonbinary      55
Name: count, dtype: int64

In [None]:
retention = retention[(retention['GENDER']!='Nonbinary')]

In [None]:
pd.Series(retention['GENDER']).value_counts()

GENDER
Female    9778
Male      6192
Name: count, dtype: int64

<a id="23"></a>
#### 2.3 Noninformative features

Next, let's further refine the retention DataFrame by removing the target, as well as identifier, colinear and unobservable variables. We typically utilise .drop methods, but due to the large amount of variables we'll drop, let's use indices to select columns. First let's identify the ordinal label of each variable:

In [None]:
for i,j in enumerate(retention.columns):
    print(i,j)

0 SID
1 COHORT
2 RACE_ETHNICITY
3 GENDER
4 FIRST_GEN_STATUS
5 HS_GPA
6 HS_MATH_GPA
7 HS_ENGL_GPA
8 COLLEGE
9 UNITS_ATTEMPTED_1
10 UNITS_ATTEMPTED_2
11 UNITS_ATTEMPTED_3
12 UNITS_ATTEMPTED_4
13 UNITS_COMPLETED_1
14 UNITS_COMPLETED_2
15 UNITS_COMPLETED_3
16 UNITS_COMPLETED_4
17 DFW_UNITS_1
18 DFW_UNITS_2
19 DFW_UNITS_3
20 DFW_UNITS_4
21 GPA_1
22 GPA_2
23 GPA_3
24 GPA_4
25 CUM_GPA_1
26 CUM_GPA_2
27 CUM_GPA_3
28 CUM_GPA_4
29 SEM_1_STATUS
30 SEM_2_STATUS
31 SEM_3_STATUS
32 SEM_4_STATUS
33 SEM_5_STATUS
34 SEM_6_STATUS
35 SEM_7_STATUS
36 SEM_8_STATUS


Now we'll refer to this list to drop the variables.

In [None]:
ret_columns_to_drop = [5,11,12] + [15,16] + list(range(17,21)) + list(range(23,29)) + list(range(31,37))

retention_copy = retention.copy()

retention2 = retention_copy.drop(retention.columns[ret_columns_to_drop], axis=1)

retention2.reset_index(inplace=True,drop=True)

retention2.head()

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,UNITS_ATTEMPTED_1,UNITS_ATTEMPTED_2,UNITS_COMPLETED_1,UNITS_COMPLETED_2,GPA_1,GPA_2,SEM_1_STATUS,SEM_2_STATUS
0,JHPSY555D,Fall 2023,Hispanic/Latino,Female,Unknown,3.97,4.09,Business,12.0,,6.0,,1.666667,,NR,NR
1,9KC4NM2YV,Fall 2023,Hispanic/Latino,Male,First Generation,3.67,3.77,Arts,15.0,12.0,6.0,0.0,2.0,0.0,C,NR
2,33M8O2J01,Fall 2023,Hispanic/Latino,Male,First Generation,2.78,3.19,University Programs,13.0,,10.0,,2.0,,NR,NR
3,AMX4WP4A0,Fall 2023,Hispanic/Latino,Female,Continuing Generation,4.02,4.15,Science,13.0,15.0,13.0,11.0,2.615385,2.266667,C,C
4,R32ET2VTA,Fall 2023,Hispanic/Latino,Male,Continuing Generation,3.08,3.58,University Programs,6.0,6.0,3.0,3.0,1.0,1.5,C,NR


Removing minority classes and noninformative variables is an aspect of defining the entire data set that is integral to the machine learning process. In contrast, our next two challenges to be overcome should only be done *after* we seperate out our training data from the predict data, and then differentially with nuance therafter. They will be interspersed within our Data Splitting process.

This code separates the retention data into a training set and a prediction set. The training set includes data from the Fall 2021 and Fall 2022 cohorts using the condition `retention["COHORT"].isin(["Fall 2021", "Fall 2022"])`. The prediction set includes data from the Fall 2023 cohort using the condition `retention["COHORT"].isin(["Fall 2023"])`. The prediction set then drops columns related to semester statuses (SEM_3_STATUS to SEM_8_STATUS) since these statuses are not observed for the prediction set.

In [None]:
training = retention2[retention2["COHORT"].isin(["Fall 2021", "Fall 2022"])]
predict = retention2[retention2["COHORT"].isin(["Fall 2023"])]

#Note prediction set is strictly NR for SEM_3 and beyond, so we may drop those indicators
predict = predict.drop(columns=['SEM_2_STATUS'])

The `training` set will be our primary training dataframe for analysis. It consists of 10,245 observations, uniquely identified by SID (as well as their row index from the original **retention** data frame), and 16 columns.

In [None]:
print(f'training set no. of rows {training.shape[0]}\n')
print(f'training set no. of columns {training.shape[1]}')

training set no. of rows 10245

training set no. of columns 16


We will save the processed `training` and `predict` datasets for use in other exercises and modules.

In [None]:
#training.to_csv('/Workspace/ira-ml-cert/data/training.csv', index=False)
#predict.to_csv('/Workspace/ira-ml-cert/data/predict.csv', index=False)

As mentioned in *Module 3: Explaining the Machine Learning Cycle Without Hyperparameter Tuning* we observed that a learning algorithm is only useful to the extent that we can confidently apply it to unseen data to make accurate predictions. The ability to generalize is measured by an investigation of model performance on a random sample of the full data called the test set. Before we explore or analyze our data it is imperative that we split it into a training and test set. This step will reintroduce us to Python's machine learning powerhouse, **[scikit learn](https://scikit-learn.org/stable/index.html)**.

<a id="3"></a>
## 3 Data Splitting


<a id="31"></a>
#### 3.1 Full Data to Training and Testing

Data splitting is one of the most important steps of the machine learning cycle. We've all had instructors that, let's just say, provided alot of friendly *guidance* for what material would appear on an exam (they were pretty popular professors). Often this was in the form of a "practice exam". This led to a scenario where the exam was for all intents and purposes observed before exam day, and those who could memorize well were likely to achieve the most success. As much as stressed out college students might enjoy it, this arrangement does not facilitate genuine learning, which is demonstrated by the ability to accurately generalize concepts and constructs to new scenarios.  This is why we split data. So that instead of memorizing content and being tested on how well we can repeat it, we are attempting to learn the "how" and "why" behind the data generating process so that when new data comes from the process, we can legitimately demonstrate a deep level of understanding. Splitting the data into a train set an a test set, and not using the test set at all to learn patterns in the data will enable our model to demonstrate this deeper understanding. Let's load the **train_test_split** module from the scikit learn library and get our study on!

In [None]:
#Class for data splitting
from sklearn.model_selection import train_test_split

Figure 1 displays the first step of the data splitting process: identify and isolate the feature matrix (\\(X)\\) and label vector (\\(y)\\) in the context of an easy to visualize dataframe. The figure is followed by the code that gets this process started.

**Figure 1:** Seperating our curated DataFrame into a feature matrix \\((X)\\) and label vector \\((y\\)). An example with a DataFrame with 15 observations.


![ih](../public/figures/Xy_pic_2-3.png)

Next, let's create the feature matrix by removing the target and identifier variables.

In [None]:
#Creating the feature matrix
X = training.drop(['SID','COHORT','SEM_2_STATUS'],axis=1)

For the target variable, we need a column in which 1 represents students who leave in semester 3, and 0 represents students who were retained. Thus we need to **one hot encode** the "NR" class in our target:

In [None]:
#The one hot encoding for the NR class
y = training['SEM_2_STATUS'].apply(lambda x: 1 if x == 'NR' else 0)

The initial split was a vertical one, seperating features from label. We proceed with a horizontal split, randomly holding out a specified percentage of observations for testing.

Let's create an 80-20 split of the data for training, and testing on an unlearned hold out set. One of the most useful functions in scikit learn, **[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)** gets the job done in one line of code:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=rms)

#The random_state variable makes the code reproducible - everytime we run this code, the same observations will be allocated to the test set.

**Figure 2:** An example of an 80%-20% train-test split on a dataframe with 20 observations. Randomly sample 20% * 20 = 4 values to hold out for model testing: Observations 2,6,13 and 19.

![ih](https://github.com/ksuaray/IRML---Regression-and-Classification/blob/MLCert-Sketches/MLCert%20Sketches%202/80-20-Xy.png?raw=true)

From our original 10,280 observations, 20% \\(\times\\) 10,280 \\(\approxeq\\) 2056 will be reserved for model testing. To prevent *data leakage*, they will not be part of our data exploration or model fitting whatsoever; we don't want to peek at the test before exam day, right?

In [None]:
print(X_train.shape,X_test.shape)

(8196, 13) (2049, 13)


<a id="32"></a>
#### 3.2 Training to Build and Validation

Next we need to split our training data into a portion used to fit the data (build set), and an initially untouched part we can use to calibrate our algorithm inputs (validation set). We'll make the validation set 1/8 of the training data, resulting in a 70-10-20 build-validate-test split.

In [None]:
X_build, X_val, y_build, y_val = train_test_split(X_train,y_train,test_size=0.125,random_state=rms)

We can visualize our data splitting strategy as follows:

![data](https://github.com/ksuaray/IRML---Regression-and-Classification/blob/MLCert-Sketches/MLCert%20Sketches%202/4DataSets_MLReady0.png?raw=true)

With this completed, we shift our attention to data quality.

Armed with a complete data set on our selected cohort, we're a step closer to predictive modeling. That being said, there is still a gap between having a complete dataset, and having data prepared for analysis. Let's take some steps to get us ready for that goal.

In [None]:
X_train.to_csv('../private/Output for 2.5 & 2.6/X_train.csv', index=False)
y_train.to_csv('../private/Output for 2.5 & 2.6/y_train.csv', index=False)

X_build_c.to_csv('../private/Output for 2.5 & 2.6/X_build_c.csv', index=False)
y_build_c.to_csv('../private/Output for 2.5 & 2.6/y_build_c.csv', index=False)

X_val_c.to_csv('../private/Output for 2.5 & 2.6/X_val_c.csv', index=False)
y_val_c.to_csv('../private/Output for 2.5 & 2.6/y_val_c.csv', index=False)

X_test_c.to_csv('../private/Output for 2.5 & 2.6/X_test_c.csv', index=False)
y_test_c.to_csv('../private/Output for 2.5 & 2.6/y_test_c.csv', index=False)

Before we actually import the data into this notebook, it is important that we start things off by attending to a consideration that will affect a large number of cells in this notebook. We'll be executing quite a few commands that insert randomnes into the process, which will result in different answers every time we (and you) run this code. We can ensure *reproducibility* by setting a global seed for this notebook:

In [None]:
rms = 34

In [None]:
rng = np.random.RandomState(rms)