# Code Summary - Data Wrangling

## 1. Data Import

### Imports and Display Settings

- `pandas` is imported for data handling.
- `warnings` is used to suppress warning messages.
- `pd.set_option()` ensures all columns are visible when displaying DataFrames.
- `warnings.filterwarnings('ignore')` hides warning messages during execution.


In [None]:
import pandas as pd
import warnings

pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

### Mounting Google Drive

- Connects Google Drive to Colab using `drive.mount()` to access files stored in Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Loading the Dataset

- Loads the CSV file into a DataFrame and displays the first 5 rows for preview.


In [None]:
file_path = '/content/drive/MyDrive/projects/Applied-Data-Analytics-For-Higher-Education-Course-2/data/student_academics_data.csv'

df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,FAMILY_INCOME,HS_GPA,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,...,UNITS_COMPLETED_1,UNITS_COMPLETED_2,DFW_UNITS_1,DFW_UNITS_2,GPA_1,GPA_2,GPA_3,SEM_1_STATUS,SEM_2_STATUS,SEM_3_STATUS
0,UHDOP5522,Fall 2020,Asian,Female,Continuing Generation,,3.72,3.2,3.4,Visual & Performing Arts,...,15.0,15.0,0.0,0.0,4.0,3.785714,4.0,E,E,E
1,UHE842CU6,Fall 2021,Black or African American,Female,Continuing Generation,,3.189,2.6,3.75,Visual & Performing Arts,...,12.0,12.0,3.0,4.0,3.0,2.5,1.5,E,E,E
2,UHJFT1JAB,Fall 2018,Asian,Female,Continuing Generation,,3.625,3.4,3.5,Visual & Performing Arts,...,15.0,16.0,0.0,0.0,3.8,3.6,3.6,E,E,E
3,UHKF05TAF,Fall 2018,Hispanic,Female,First Generation,,3.606,3.0,3.375,Letters & Humanities,...,7.0,3.0,9.0,9.0,1.5625,1.0,2.5,E,E,E
4,UHKKQ8UY5,Fall 2021,Hispanic,Male,Continuing Generation,50K<,3.536,2.5,2.625,Letters & Humanities,...,13.0,13.0,0.0,0.0,3.538462,3.769231,3.4,E,E,E


<a id="2"></a>
## 2 Data Wrangling

- Displays the number of records in each cohort using `value_counts()`.


In [None]:
df['COHORT'].value_counts()

Unnamed: 0_level_0,count
COHORT,Unnamed: 1_level_1
Fall 2022,5363
Fall 2019,5170
Fall 2018,4954
Fall 2020,4910
Fall 2021,4866


- Groups the data by cohort and third-semester status, counts the occurrences, and resets the index with a new column `COUNTS`.


In [None]:
df[['COHORT', 'SEM_3_STATUS']].groupby(['COHORT', 'SEM_3_STATUS']).size().reset_index(name='COUNTS')

Unnamed: 0,COHORT,SEM_3_STATUS,COUNTS
0,Fall 2018,E,4307
1,Fall 2018,N,647
2,Fall 2019,E,4583
3,Fall 2019,N,587
4,Fall 2020,E,4239
5,Fall 2020,N,671
6,Fall 2021,E,4133
7,Fall 2021,N,733
8,Fall 2022,E,4540
9,Fall 2022,N,823


# Drop unnecessary columns outside of analysis scope

- Drops `SEM_1_STATUS` and `SEM_2_STATUS` since the focus is on predicting third-semester outcomes.


In [None]:
df.drop(['SEM_1_STATUS', 'SEM_2_STATUS'], axis=1, inplace=True)

# Check for the number of missing values in each column.



- Checks and displays the number of missing values in each column using `df.isnull().sum()`.




In [None]:
display(df.isnull().sum())

Unnamed: 0,0
SID,0
COHORT,0
RACE_ETHNICITY,0
GENDER,0
FIRST_GEN_STATUS,0
FAMILY_INCOME,20967
HS_GPA,124
HS_MATH_GPA,359
HS_ENGL_GPA,359
COLLEGE,0


<a id="4"></a>
# Addressing Missingness

- Drops columns where more than 50% of values are missing and displays the updated DataFrame.




In [None]:
missing_values_count = df.isnull().sum()
total_rows = len(df)
columns_to_drop = missing_values_count[missing_values_count / total_rows > 0.5].index.tolist()
df.drop(columns=columns_to_drop, inplace=True)
display(f"Number of remaining columns: {df.shape[1]}")
display(df.head())

'Number of remaining columns: 19'

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,HS_GPA,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,UNITS_ATTEMPTED_1,UNITS_ATTEMPTED_2,UNITS_COMPLETED_1,UNITS_COMPLETED_2,DFW_UNITS_1,DFW_UNITS_2,GPA_1,GPA_2,GPA_3,SEM_3_STATUS
0,UHDOP5522,Fall 2020,Asian,Female,Continuing Generation,3.72,3.2,3.4,Visual & Performing Arts,15.0,14.0,15.0,15.0,0.0,0.0,4.0,3.785714,4.0,E
1,UHE842CU6,Fall 2021,Black or African American,Female,Continuing Generation,3.189,2.6,3.75,Visual & Performing Arts,12.0,12.0,12.0,12.0,3.0,4.0,3.0,2.5,1.5,E
2,UHJFT1JAB,Fall 2018,Asian,Female,Continuing Generation,3.625,3.4,3.5,Visual & Performing Arts,15.0,15.0,15.0,16.0,0.0,0.0,3.8,3.6,3.6,E
3,UHKF05TAF,Fall 2018,Hispanic,Female,First Generation,3.606,3.0,3.375,Letters & Humanities,16.0,9.0,7.0,3.0,9.0,9.0,1.5625,1.0,2.5,E
4,UHKKQ8UY5,Fall 2021,Hispanic,Male,Continuing Generation,3.536,2.5,2.625,Letters & Humanities,13.0,13.0,13.0,13.0,0.0,0.0,3.538462,3.769231,3.4,E



## Rare Classes in Features

- Displays the count of each category in the `RACE_ETHNICITY` column to identify rare classes.


In [None]:
#Distribution of classes in RACE_ETHNICITY
pd.Series(retention['RACE_ETHNICITY']).value_counts()

RACE_ETHNICITY
Hispanic/Latino                              8407
Asian                                        3510
White                                        2030
Two or More Races                             735
Black or African American                     608
Visa Non-U.S.                                 480
Unknown                                       206
Native Hawaiian or Other Pacific Islander      34
American Indian or Alaska Native               15
Name: count, dtype: int64

- Creates a condition to identify rows where `RACE_ETHNICITY` is 'American Indian or Alaska Native', 'Native Hawaiian or Other Pacific Islander', or 'Unknown', then locates their row indices.

- Replaces those values with 'Other' to consolidate rare classes, and updates the distribution using `value_counts()`.


In [None]:
#Consolidating the three smallest classes into one 'Other' class
condition = (retention['RACE_ETHNICITY'] == 'American Indian or Alaska Native') | \
            (retention['RACE_ETHNICITY'] == 'Native Hawaiian or Other Pacific Islander') | \
            (retention['RACE_ETHNICITY'] == 'Unknown')

# Get the original indices where the condition is true using loc
indices_true = retention.loc[condition].index

# Update 'RACE_ETHNICITY' to 'Other' for rows with true condition
retention.loc[indices_true, 'RACE_ETHNICITY'] = 'Other'

pd.Series(retention['RACE_ETHNICITY']).value_counts()

RACE_ETHNICITY
Hispanic/Latino              8407
Asian                        3510
White                        2030
Two or More Races             735
Black or African American     608
Visa Non-U.S.                 480
Other                         255
Name: count, dtype: int64

- Displays the value counts of `FIRST_GEN_STATUS` to check for rare categories, confirming all are sufficiently represented.


In [None]:
#Distribution of classes in FIRST_GEN_STATUS
pd.Series(retention['FIRST_GEN_STATUS']).value_counts()

FIRST_GEN_STATUS
Continuing Generation    9926
First Generation         4653
Unknown                  1446
Name: count, dtype: int64

- Displays the distribution of values in the `GENDER` column to identify rare categories, revealing a small number of 'Nonbinary' entries.


In [None]:
#Distribution of classes in GENDER
pd.Series(retention['GENDER']).value_counts()

GENDER
Female       9778
Male         6192
Nonbinary      55
Name: count, dtype: int64

- Filters out rows where `GENDER` is 'Nonbinary' and updates the distribution using `value_counts()`.


In [None]:
retention = retention[(retention['GENDER']!='Nonbinary')]

In [None]:
pd.Series(retention['GENDER']).value_counts()

GENDER
Female    9778
Male      6192
Name: count, dtype: int64

# Non-Affirmative Features

- Uses `enumerate()` to print column names with their index positions for easier reference when selecting columns to drop.


In [None]:
for i,j in enumerate(retention.columns):
    print(i,j)

0 SID
1 COHORT
2 RACE_ETHNICITY
3 GENDER
4 FIRST_GEN_STATUS
5 HS_GPA
6 HS_MATH_GPA
7 HS_ENGL_GPA
8 COLLEGE
9 UNITS_ATTEMPTED_1
10 UNITS_ATTEMPTED_2
11 UNITS_ATTEMPTED_3
12 UNITS_ATTEMPTED_4
13 UNITS_COMPLETED_1
14 UNITS_COMPLETED_2
15 UNITS_COMPLETED_3
16 UNITS_COMPLETED_4
17 DFW_UNITS_1
18 DFW_UNITS_2
19 DFW_UNITS_3
20 DFW_UNITS_4
21 GPA_1
22 GPA_2
23 GPA_3
24 GPA_4
25 CUM_GPA_1
26 CUM_GPA_2
27 CUM_GPA_3
28 CUM_GPA_4
29 SEM_1_STATUS
30 SEM_2_STATUS
31 SEM_3_STATUS
32 SEM_4_STATUS
33 SEM_5_STATUS
34 SEM_6_STATUS
35 SEM_7_STATUS
36 SEM_8_STATUS


- Creates a list of column indices to drop, covering non-informative, colinear, or out-of-scope features.
- Makes a copy of the DataFrame, removes the selected columns, resets the index, and previews the updated data.


In [None]:
ret_columns_to_drop = [5,11,12] + [15,16] + list(range(17,21)) + list(range(23,29)) + list(range(31,37))

retention_copy = retention.copy()

retention2 = retention_copy.drop(retention.columns[ret_columns_to_drop], axis=1)

retention2.reset_index(inplace=True,drop=True)

retention2.head()

Unnamed: 0,SID,COHORT,RACE_ETHNICITY,GENDER,FIRST_GEN_STATUS,HS_MATH_GPA,HS_ENGL_GPA,COLLEGE,UNITS_ATTEMPTED_1,UNITS_ATTEMPTED_2,UNITS_COMPLETED_1,UNITS_COMPLETED_2,GPA_1,GPA_2,SEM_1_STATUS,SEM_2_STATUS
0,JHPSY555D,Fall 2023,Hispanic/Latino,Female,Unknown,3.97,4.09,Business,12.0,,6.0,,1.666667,,NR,NR
1,9KC4NM2YV,Fall 2023,Hispanic/Latino,Male,First Generation,3.67,3.77,Arts,15.0,12.0,6.0,0.0,2.0,0.0,C,NR
2,33M8O2J01,Fall 2023,Hispanic/Latino,Male,First Generation,2.78,3.19,University Programs,13.0,,10.0,,2.0,,NR,NR
3,AMX4WP4A0,Fall 2023,Hispanic/Latino,Female,Continuing Generation,4.02,4.15,Science,13.0,15.0,13.0,11.0,2.615385,2.266667,C,C
4,R32ET2VTA,Fall 2023,Hispanic/Latino,Male,Continuing Generation,3.08,3.58,University Programs,6.0,6.0,3.0,3.0,1.0,1.5,C,NR


- Splits the data into a training set (Fall 2021 & 2022) and a prediction set (Fall 2023).
- Drops `SEM_2_STATUS` from the prediction set since it's not observed for Fall 2023 students.


In [None]:
training = retention2[retention2["COHORT"].isin(["Fall 2021", "Fall 2022"])]
predict = retention2[retention2["COHORT"].isin(["Fall 2023"])]

#Note prediction set is strictly NR for SEM_3 and beyond, so we may drop those indicators
predict = predict.drop(columns=['SEM_2_STATUS'])

- Prints the number of rows and columns in the `training` DataFrame.


In [None]:
print(f'training set no. of rows {training.shape[0]}\n')
print(f'training set no. of columns {training.shape[1]}')

training set no. of rows 10245

training set no. of columns 16


- (Optional) Saves the `training` and `predict` DataFrames as CSV files for future use.


In [None]:
#training.to_csv('/Workspace/ira-ml-cert/data/training.csv', index=False)
#predict.to_csv('/Workspace/ira-ml-cert/data/predict.csv', index=False)

<a id="3"></a>
## 3 Data Splitting


<a id="31"></a>
#### 3.1 Full Data to Training and Testing

- Imports `train_test_split` from scikit-learn to split data into training and testing sets.


In [None]:
#Class for data splitting
from sklearn.model_selection import train_test_split

- Creates the feature matrix `X` by dropping identifier and target-related columns from the training set.


In [None]:
#Creating the feature matrix
X = training.drop(['SID','COHORT','SEM_2_STATUS'],axis=1)

- Creates the target variable `y` by encoding students with `SEM_2_STATUS` as 'NR' as 1 (not retained), and others as 0 (retained).


In [None]:
#The one hot encoding for the NR class
y = training['SEM_2_STATUS'].apply(lambda x: 1 if x == 'NR' else 0)

- Splits the data into training and test sets using an 80–20 ratio with `train_test_split()`, ensuring reproducibility using `random_state`.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=rms)

#The random_state variable makes the code reproducible - everytime we run this code, the same observations will be allocated to the test set.

**Figure 2:** An example of an 80%-20% train-test split on a dataframe with 20 observations. Randomly sample 20% * 20 = 4 values to hold out for model testing: Observations 2,6,13 and 19.

![ih](https://github.com/ksuaray/IRML---Regression-and-Classification/blob/MLCert-Sketches/MLCert%20Sketches%202/80-20-Xy.png?raw=true)

From our original 10,280 observations, 20% \\(\times\\) 10,280 \\(\approxeq\\) 2056 will be reserved for model testing. To prevent *data leakage*, they will not be part of our data exploration or model fitting whatsoever; we don't want to peek at the test before exam day, right?

In [None]:
print(X_train.shape,X_test.shape)

(8196, 13) (2049, 13)


<a id="32"></a>
#### 3.2 Training to Build and Validation

Next we need to split our training data into a portion used to fit the data (build set), and an initially untouched part we can use to calibrate our algorithm inputs (validation set). We'll make the validation set 1/8 of the training data, resulting in a 70-10-20 build-validate-test split.

In [None]:
X_build, X_val, y_build, y_val = train_test_split(X_train,y_train,test_size=0.125,random_state=rms)

We can visualize our data splitting strategy as follows:

![data](https://github.com/ksuaray/IRML---Regression-and-Classification/blob/MLCert-Sketches/MLCert%20Sketches%202/4DataSets_MLReady0.png?raw=true)

With this completed, we shift our attention to data quality.

Armed with a complete data set on our selected cohort, we're a step closer to predictive modeling. That being said, there is still a gap between having a complete dataset, and having data prepared for analysis. Let's take some steps to get us ready for that goal.

In [None]:
X_train.to_csv('../private/Output for 2.5 & 2.6/X_train.csv', index=False)
y_train.to_csv('../private/Output for 2.5 & 2.6/y_train.csv', index=False)

X_build_c.to_csv('../private/Output for 2.5 & 2.6/X_build_c.csv', index=False)
y_build_c.to_csv('../private/Output for 2.5 & 2.6/y_build_c.csv', index=False)

X_val_c.to_csv('../private/Output for 2.5 & 2.6/X_val_c.csv', index=False)
y_val_c.to_csv('../private/Output for 2.5 & 2.6/y_val_c.csv', index=False)

X_test_c.to_csv('../private/Output for 2.5 & 2.6/X_test_c.csv', index=False)
y_test_c.to_csv('../private/Output for 2.5 & 2.6/y_test_c.csv', index=False)

Before we actually import the data into this notebook, it is important that we start things off by attending to a consideration that will affect a large number of cells in this notebook. We'll be executing quite a few commands that insert randomnes into the process, which will result in different answers every time we (and you) run this code. We can ensure *reproducibility* by setting a global seed for this notebook:

In [None]:
rms = 34

In [None]:
rng = np.random.RandomState(rms)