<a href="https://colab.research.google.com/github/kellianneyang/project-exploration/blob/main/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grades Project: Data Cleaning

This notebook continues the work begun in the previous notebook for this project, where all .csv files associated with this dataset were merged.

The goal of this notebook is to prepare data for machine learning preprocessing. We would like to understand the data and understand what we want from the data.

This notebook will:
- Create new columns as necessary and delete columns as necessary (preliminary feature engineering)
- Identify possible target variables
- Delete any rows from the dataframes that do not have values for the target variables
- Delete any duplicated rows
- Inspect missing values (missing values will NOT be imputed; this will be left for machine learning preprocessing)
- Identify outliers

## Overview


To use the data provided in this dataset (https://www.kaggle.com/datasets/Madgrades/uw-madison-courses), which is provided in multiple .csv files, for a machine learning model, we will need to merge the files to create one .csv file.

## .csv files and columns

Included in dataset as separate .csv files:
- schedules.csv: each row is a unique potential schedule
    - (schedule) uuid: unique identifier of schedules
    - start_time: start of class, in minutes
    - end_time: end of class, in minutes (drop -- high co-linearity with start_time)
    - mon: boolean, if class meets on monday
    - tues: "
    - wed: "
    - thurs: "
    - fri: "
    - sat: "
    - sun: "
      - sat and sun have very few values; combine to get more observations
- subjects.csv: each row is a unique subject
    - (subject) code: 3-digit unique identifier of subjects
    - (subject) name: name of subject
    - (subject) abbreviation: abbreviation (e.g. ENGL for English)
- teachings.csv: each row is a unique instructor
    - instructor_id: numeric unique identifier of instructor
    - section_uuid: section taught by instructor
- subject_memberships.csv: each row is a course offering (course offered in certain term; does not encompass all sections in that term)
    - subject_code: subject code associated with course offering
    - course_offering_uuid: unique identifier of course offerings
- sections.csv: each row is a section (specific instance of course at certain time in certain place in certain term)
    - (section) uuid: unique identifier of section (alphanumeric)
      - could have multiple section uuids for cross-listed sections
    - course_offering_uuid: unique identifier of course offering (course offered in certain term but encompasses all sections in that term) 
    - section_type: 3-letter identifier (e.g., LEC for lecture)
    - (section) number: 1-3-number of section (e.g., 301 for section 301)
    - room_uuid: unique identifier of room and building where section is held (including online and off-campus designations)
    - schedule_uuid: unique identifier of schedule for section
- rooms.csv: each row is a specific place
    - (room) uuid: unique identifier of building and room (inclduing off-campus and online)
    - facility_code: unique identifier of building
    - room_code: number of room within a building
- instructors.csv (will not use -- is essentially a duplicate of teachings; can use to look up instructors' names): each row is an instructor
    - (instructor) id: unique identifier of instructor
    - (instructor) name: instructor's name
- grade_distributions.csv: each row is a section's grades
    - course_offering_uuid: identifies the course offering (certain course in certain term, but not broken down into sections)
    - section_number: number of section for grades
    - a_count: number of As
    - ab_count: number of ABs (can be combined with As as needed)
    - b_count: number of Bs
    - bc_count: number of BCs (can be combined with Bs as needed)
    - c_count: number of Cs
    - d_count: number of Ds
    - f_count: number of Fs
    - s_count: satisfactory
    - u_count: unstatisfactory
    - cr_count: credit
    - n_count: no credit
    - p_count
    - i_count: incomplete
    - nw_count: no work
    - nr_count
- course_offerings.csv: each row is a course offering (course offered in certain term)
    - (course offering) uuid: unique identifier of course offering
    - course_uuid: course that the course offering belongs to
    - term_code: academic term when course offering was held
    - (course offering) name: name associate with course offering (can be different than course name)
- courses.csv: each row is a course (abstract; not associated with specific course offering)
    - (course) uuid: uniquely identifies the course
    - (course) name: name of course
    - (course) number: number in course catalog (e.g. 101 for ENGL 101)

## Feature variables

This dataset is large and has many variables we can choose from as our feature variables.

Many of the variables have a large number of unique values (high-cardinality), so we may have to drop those for our machine learning algorithm to be able to predict some variance.

We will also be combining some columns and dropping some columns for various reasons. 

## Target variables

We are interested in whether the feature variables can predict variation in grades. There are many grade types given (see source: https://guide.wisc.edu/undergraduate/#enrollmentandrecordstext), but to keep the problem manageable, we will narrow down the target grades.

We will instantiate the following target variables and see if machine learning algorithms can predict the variation in any of them.

1. Proportion of A grades given (of all grades given)
2. Proportion of F grades given (of all grades given)
3. Average letter grade given (letter grades: A, AB, B, BC, C, D, and F)
4. Number of grades given

# Preliminary Steps

In [125]:
# import libraries
import pandas as pd
import seaborn as sns
import numpy as np

In [64]:
# load data
path = 'Data/all_grades_data.csv'
df = pd.read_csv(path, low_memory = False, index_col = 0)

In [65]:
# inspect
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162710 entries, 0 to 162709
Data columns (total 43 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   section_uuid          162710 non-null  object 
 1   course_offering_uuid  162710 non-null  object 
 2   section_type          162710 non-null  object 
 3   section_number        162710 non-null  float64
 4   room_uuid             66778 non-null   object 
 5   schedule_uuid         162710 non-null  object 
 6   instructor_id         162682 non-null  float64
 7   facility_code         66778 non-null   object 
 8   room_code             64871 non-null   object 
 9   start_time            162710 non-null  float64
 10  end_time              162710 non-null  float64
 11  mon                   162710 non-null  bool   
 12  tues                  162710 non-null  bool   
 13  wed                   162710 non-null  bool   
 14  thurs                 162710 non-null  bool   
 15  

In [66]:
# inspect
df.sample(10)

Unnamed: 0,section_uuid,course_offering_uuid,section_type,section_number,room_uuid,schedule_uuid,instructor_id,facility_code,room_code,start_time,...,f_count,s_count,u_count,cr_count,n_count,p_count,i_count,nw_count,nr_count,other_count
19009,cbac306d-f905-3ccf-9f7d-5f9442a05539,d9a497db-6a2b-394b-ae5a-1291819c7de9,ind,60.0,,f2d66a4d-0c08-3b48-abf6-649fffd7ae90,2601315.0,,,-1.0,...,0,0,0,0,0,0,0,0,0,0
67790,84512989-316c-370c-8a33-cbcb3791af02,fa7e9a07-ec4d-3e31-b376-d8bb0a0b47b0,ind,16.0,,f2d66a4d-0c08-3b48-abf6-649fffd7ae90,2601570.0,,,-1.0,...,0,0,0,0,0,0,0,0,0,0
146834,423a07c3-e408-3393-a49a-a111afd3b09a,3412570c-4052-3744-ad0c-08c78215ec87,lab,1.0,53ee561a-3e94-310a-9d6d-201601a27306,2aaa8364-5590-3985-b39e-edaf597d0f8f,4955455.0,469.0,7451.0,990.0,...,0,0,0,0,0,0,0,0,0,0
53460,c8ec9106-9201-39bc-8b25-9d17fe26c258,717e506e-cff5-3c6e-8b03-d64a68ff0597,sem,8.0,,f2d66a4d-0c08-3b48-abf6-649fffd7ae90,2601560.0,,,-1.0,...,0,0,0,0,0,0,0,0,0,0
7473,c4bf5d98-1b37-3db8-ba2e-0f5d9e6cdbbd,45d4b00e-152d-3149-8cfe-32b07c6c7be8,lec,1.0,f11d9506-23cd-3052-b1fa-e7261a80add7,bcf762ed-780b-3bef-a285-c743f06cbe28,2601971.0,118.0,128.0,900.0,...,0,0,0,0,0,0,0,0,0,0
111218,84e04fef-e9aa-3ef0-bdbd-8d2d33911437,1d938124-07f8-3c8c-a499-ebcb7e22b530,lec,1.0,6c0b393c-8208-34a0-bb41-4b4ec8292357,3f78e049-8448-355b-b5a0-725b0f85f1d1,1016452.0,46.0,6102.0,480.0,...,2,0,0,0,0,0,0,0,0,0
112170,2d0c4512-003c-30f2-ac98-9ba2e4c68b21,47056cc9-dc99-3a60-9df5-1bea8b3a6ce8,lec,1.0,ce8bbeed-5bf1-3044-b7f1-3310d5d99e58,f273853e-fa6d-38ba-8707-63bb7d16d8cd,4113134.0,482.0,378.0,870.0,...,0,0,0,0,0,0,0,0,0,0
121255,c3d9fa16-5f13-37b1-b3ce-60b495b8774f,d48cd389-a77f-385b-a29a-47e2e16fff8a,lec,1.0,b3435a89-719a-3ef1-a73b-9f26dede8e56,22b99e69-ae34-3577-bf01-899774ba3047,164464.0,408.0,2534.0,595.0,...,2,1,0,0,0,0,0,0,0,0
162679,9931e0c5-af09-39b4-b84c-c4f0613342c9,ccf20fd7-a0f0-390e-82c8-3a268ce70881,lec,1.0,f6a4c14e-6c1c-38db-8966-031c10958892,40d52ac7-8397-3157-9fad-bf4f923ab555,4194092.0,476.0,245.0,480.0,...,0,0,0,0,0,0,0,0,0,0
378,1c29cc4f-bf9e-3964-8a12-dff96686ef17,78a11bde-275c-313a-9c65-ea93faacdeba,lec,31.0,b245416a-1cb2-3cd6-919c-4bde53be2155,b8f8227d-d5bb-35a8-b903-d3213ede8a00,5514889.0,545.0,4018.0,725.0,...,0,0,0,0,0,0,0,0,0,0


In [67]:
# inspect
df.columns

Index(['section_uuid', 'course_offering_uuid', 'section_type',
       'section_number', 'room_uuid', 'schedule_uuid', 'instructor_id',
       'facility_code', 'room_code', 'start_time', 'end_time', 'mon', 'tues',
       'wed', 'thurs', 'fri', 'sat', 'sun', 'subject_code', 'subject_name',
       'subject_abbreviation', 'course_uuid', 'term_code',
       'course_offering_name', 'course_name', 'course_number',
       'course_and_section', 'a_count', 'ab_count', 'b_count', 'bc_count',
       'c_count', 'd_count', 'f_count', 's_count', 'u_count', 'cr_count',
       'n_count', 'p_count', 'i_count', 'nw_count', 'nr_count', 'other_count'],
      dtype='object')

# Grades

Information from the University of Wisconsin's website about grades can be found here: https://guide.wisc.edu/undergraduate/#enrollmentandrecordstext

Grades:
- a, ab, b, bc, c, d, f - letter grades
  - ab - intermediate grade between a and b
  - bc - intermediate grade between b and c
- s - satisfactory, used in pass/fail courses
- u - unsatisfactory, used in pass/fail courses
- cr - credit, used in credit/no credit courses
- n - no credit, used in credit/no credit courses
- p - progress: temporary grade used for courses extending beyond one term
- i - incomplete: temporary grade used when work is not completed during a term
- nw - no work: for students who enroll in a course and then never attend
- nr - no report: a grade was not submitted by the instructor
- other - any one of several other grading codes

In [78]:
# for every row
for i in range(len(df)):

    # create counts of all grades
    num_a = df.at[i, 'a_count']
    num_f = df.at[i, 'f_count']
    num_a = df.at[i, 'a_count']
    num_ab = df.at[i, 'ab_count']
    num_b = df.at[i, 'b_count']
    num_bc = df.at[i, 'bc_count']
    num_c = df.at[i, 'c_count']
    num_d = df.at[i, 'd_count']
    num_f = df.at[i, 'f_count']
    num_s = df.at[i, 's_count']
    num_u = df.at[i, 'u_count']
    num_cr = df.at[i, 'cr_count']
    num_n = df.at[i, 'n_count']
    num_p = df.at[i, 'p_count']
    num_i = df.at[i, 'i_count']
    num_nw = df.at[i, 'nw_count']
    num_nr = df.at[i, 'nr_count']
    num_other = df.at[i, 'other_count']
    
    # calculate number of letter grades
    num_letter_grades = (num_a + num_ab + num_b + num_bc + num_c + num_d + 
                         num_f)
    
    # if there are letter grades
    if num_letter_grades != 0:
        
        # calculate the average grade and store
        avg_letter_grade = ((num_a * 4.0) + 
                            (num_ab * 3.5) + 
                            (num_b * 3.0) +
                            (num_bc * 2.5) + 
                            (num_c * 2.0) + 
                            (num_d * 1.0)) / num_letter_grades

        df.at[i, 'avg_letter_grade'] = avg_letter_grade

    # calculate number of all grades
    num_all_grades = (num_a + num_ab + num_b + num_bc + num_c + num_d + 
                      num_f + num_s + num_u + num_cr + num_n + num_p + 
                      num_i + num_nw + num_nr + num_other)
    
    # create column for num_all_grades to easily be able to delete courses 
    # where no grades were recorded
    df.at[i, 'num_all_grades'] = num_all_grades

    # if num_all_grades is not 0 (we want to exclude rows where there are no 
    # recorded grades)
    if num_all_grades != 0:

        # calculate and store proportions
        a_proportion = (num_a) / (num_all_grades)
        df.at[i, 'a_proportion'] = a_proportion
        
        ab_proportion = (num_ab) / (num_all_grades)
        df.at[i, 'ab_proportion'] = ab_proportion
        
        b_proportion = (num_b) / (num_all_grades)
        df.at[i, 'b_proportion'] = b_proportion
        
        bc_proportion = (num_bc) / (num_all_grades)
        df.at[i, 'bc_proportion'] = bc_proportion
        
        c_proportion = (num_c) / (num_all_grades)
        df.at[i, 'c_proportion'] = c_proportion
        
        d_proportion = (num_d) / (num_all_grades)
        df.at[i, 'd_proportion'] = d_proportion
        
        f_proportion = (num_f) / (num_all_grades)
        df.at[i, 'f_proportion'] = f_proportion
        
        s_proportion = (num_s) / (num_all_grades)
        df.at[i, 's_proportion'] = s_proportion
        
        u_proportion = (num_u) / (num_all_grades)
        df.at[i, 'u_proportion'] = u_proportion
        
        cr_proportion = (num_cr) / (num_all_grades)
        df.at[i, 'cr_proportion'] = cr_proportion
        
        n_proportion = (num_n) / (num_all_grades)
        df.at[i, 'n_proportion'] = n_proportion
        
        p_proportion = (num_p) / (num_all_grades)
        df.at[i, 'p_proportion'] = p_proportion
        
        i_proportion = (num_i) / (num_all_grades)
        df.at[i, 'i_proportion'] = i_proportion
        
        nw_proportion = (num_nw) / (num_all_grades)
        df.at[i, 'nw_proportion'] = nw_proportion
        
        nr_proportion = (num_nr) / (num_all_grades)
        df.at[i, 'nr_proportion'] = nr_proportion
        
        other_proportion = (num_other) / (num_all_grades)
        df.at[i, 'other_proportion'] = other_proportion


In [79]:
# check
df[['a_proportion', 'ab_proportion', 'b_proportion', 'bc_proportion', \
    'c_proportion', 'd_proportion', 'f_proportion', 's_proportion', \
    'u_proportion', 'cr_proportion', 'n_proportion', 'p_proportion', \
    'i_proportion', 'nw_proportion', 'nr_proportion', 'other_proportion', \
    'num_all_grades', 'avg_letter_grade']]

Unnamed: 0,a_proportion,ab_proportion,b_proportion,bc_proportion,c_proportion,d_proportion,f_proportion,s_proportion,u_proportion,cr_proportion,n_proportion,p_proportion,i_proportion,nw_proportion,nr_proportion,other_proportion,num_all_grades,avg_letter_grade
0,,,,,,,,,,,,,,,,,0.0,
1,,,,,,,,,,,,,,,,,0.0,
2,,,,,,,,,,,,,,,,,0.0,
3,,,,,,,,,,,,,,,,,0.0,
4,,,,,,,,,,,,,,,,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162705,0.22500,0.237500,0.312500,0.1125,0.0375,0.0625,0.012500,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,80.0,3.087500
162706,,,,,,,,,,,,,,,,,0.0,
162707,,,,,,,,,,,,,,,,,0.0,
162708,0.52381,0.238095,0.047619,0.0000,0.0000,0.0000,0.047619,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21.0,3.583333


## Drop courses with no grades

In [80]:
# check number of rows
df.shape

(162710, 61)

In [81]:
# check if any rows have null for 'num_all_grades'
df['num_all_grades'].isna().sum()

0

In [82]:
# check number of courses with zero grades
len(df[df['num_all_grades'] == 0])

94183

In [83]:
# drop courses with zero grades
df = df[df['num_all_grades'] != 0]

# check
df.shape

(68527, 61)

# 'term_code'

The current 'term_code' column gives a 4-digit code indicating the term the course was offered in. 

'term_code':
- first digit always 1 (21st century)
- second two digits academic year (from 07 to 18; year that the academic year ends in summer)
- fourth digit term (2 = fall, 4 = spring, 6 = summer)

Source: https://kb.wisc.edu/registrar/117776

In [84]:
# inspect
df['term_code'].value_counts(dropna = False)
# data includes years '07' to '18' and only fall (2) and spring (4) terms

1072.0    3301
1102.0    3241
1082.0    3240
1092.0    3235
1152.0    3215
1122.0    3184
1074.0    3182
1132.0    3146
1112.0    3134
1142.0    3125
1084.0    3116
1162.0    3103
1172.0    3102
1104.0    3071
1134.0    3048
1182.0    3045
1094.0    3045
1114.0    3035
1144.0    3018
1154.0    3002
1164.0    2972
1174.0    2967
Name: term_code, dtype: int64

We will split 'term_code' into one column for year (numeric, 7-18) and one for term (nominal, fall or spring)

In [85]:
# reset index to iterate through
df.reset_index(drop = True, inplace = True)

In [89]:
# iterate through all rows in df
for row in range(len(df)):

    # make sure string is 4 or more characters long
    if len(df.loc[row, 'term_code'].astype(str)) >= 4:

        # assign 'year' as the middle two characters in the string
        df.loc[row, 'year'] = df.loc[row, 'term_code'].astype(str)[1:3]

        # if string ends in 2
        if df.loc[row, 'term_code'].astype(str)[3] == '2':

            # assign 'term' as 'fall'
            df.loc[row, 'term'] = 'fall'
  
        # else
        else:

            # assign 'term' as 'spring'
            df.loc[row, 'term'] = 'spring'

In [90]:
# check
print(df['year'].value_counts(dropna = False))
print()
print(df['term'].value_counts(dropna = False))

07    6483
08    6356
10    6312
09    6280
15    6217
13    6194
11    6169
14    6143
16    6075
17    6069
12    3184
18    3045
Name: year, dtype: int64

fall      38071
spring    30456
Name: term, dtype: int64


In [91]:
# drop 'term_code' as it is now redundant
df = df.drop(columns = 'term_code')

# check
print('term_code' in df.columns)

False


# Rooms and buildings

Every room_uuid has a facility_code which indicates a building, and a room_code which indicates a room in the building. We will drop the room_code column because it is likely a high-cardinality variable, since there are many, many rooms on campus. We will keep the building (facility_code). 

In [92]:
df.drop(columns = ['room_uuid', 'room_code'], inplace = True)

# check
print('room_uuid' in df.columns)
print('room_code' in df.columns)

False
False


# Schedules

'start_time' and 'end_time' will probably have a high degree of co-linearity due to many classes being on a fixed schedule. So, we can drop 'end_time'. 

But, we are interested in if the length of a class may explain some variance in the grades earned, so first we will calculate 'class_length'. 'class_length' will represent the length of the class in minutes on any day of the week that the class meets (if class meets from 9:00am to 9:50am on MWF, 'class_length' will be 50 minutes). 

In [93]:
# 'class_length'
df['class_length'] = df['end_time'] - df['start_time']

In [95]:
# drop 'end_time'
df.drop(columns = 'end_time', inplace = True)

# check
'end_time' in df.columns

False

In [96]:
# reset index for further cleaning
df.reset_index(inplace = True, drop = True)

In [97]:
# drop 'schedule_uuid'; it correlates with the other schedule variables
df.drop(columns = 'schedule_uuid', inplace = True)

# check
'schedule_uuid' in df.columns

False

# Subjects

Delete 'subject_abbreviations' and 'subject_codes', as these are repeats of the 'subject_name' column.

In [98]:
df.drop(columns = ['subject_code', 'subject_abbreviation'], inplace = True)

# check
print('subject_code' in df.columns)
print('subject_abbreviation' in df.columns)

False
False


# Course number

The 'course_number' column includes numbers in the following ranges:
- below 100: below college-level
- 100-299: elementary (undergraduate students only)
- 300-499: intermediate (grad & undergrad)
- 500-699: advanced (grad & undergrad)
- 700+: graduate students only

Source: https://grad.wisc.edu/documents/course-numbering-system/

We will split up the courses into these ranges to reduce the cardinality of the variable.

We will drop the below college-level and graduate student only courses and retain just courses that are elementary, intermediate, or advanced, and that undergraduate students may take. (But we will wait until after merging with big_df to do so, so that we delete the other information for these courses as well.)

In [99]:
for i in range(len(df)):
    if df.at[i, 'course_number'] < 100:
        df.at[i, 'course_difficulty'] = 'below level'
    elif df.at[i, 'course_number'] < 300:
        df.at[i, 'course_difficulty'] = 'elementary'
    elif df.at[i, 'course_number'] < 500:
        df.at[i, 'course_difficulty'] = 'intermediate'
    elif df.at[i, 'course_number'] < 700:
        df.at[i, 'course_difficulty'] = 'advanced'
    elif df.at[i, 'course_number'] >= 700:
        df.at[i, 'course_difficulty'] = 'grad level'

In [100]:
# check
df['course_difficulty'].value_counts()

elementary      19932
intermediate    18671
grad level      17756
advanced        11603
below level       565
Name: course_difficulty, dtype: int64

In [101]:
# we are only interested in 'elementary', 'intermediate', and 'advanced' courses
df.drop(df[(df['course_difficulty'] == 'below level') | 
           (df['course_difficulty'] == 'grad level')].index, 
        inplace = True)

In [102]:
# check
df['course_difficulty'].value_counts()

elementary      19932
intermediate    18671
advanced        11603
Name: course_difficulty, dtype: int64

In [103]:
# drop course_number column
df.drop(columns = 'course_number', inplace = True)

# check
'course_number' in df.columns

False

# course_uuid

In [104]:
# can drop 'course_uuid' column because it should match up with 'course_name' column
df.drop(columns = 'course_uuid', inplace = True)

# check
'course_uuid' in df.columns

False

# 'instructor_id'

The 'instructor_id' column has too many unique values to be useful for the machine learning algortihm. We will group instructors into bins based on number of courses taught.

To get about 100 unique values in the column, we will bin every instructor who taught fewer classes than the top 100 instructors.

In [106]:
df['instructor_id'].nunique()

7184

In [141]:
# get numbers of courses taught for top 100 instructors
df['instructor_id'].value_counts().values[:100]

array([137, 132, 106,  85,  81,  79,  77,  70,  69,  67,  67,  67,  66,
        66,  64,  63,  63,  62,  61,  61,  61,  61,  60,  60,  59,  59,
        58,  58,  58,  57,  57,  56,  55,  55,  54,  54,  54,  53,  53,
        53,  52,  52,  52,  51,  51,  51,  50,  50,  50,  50,  49,  48,
        48,  48,  48,  48,  47,  47,  46,  46,  46,  46,  46,  45,  45,
        45,  45,  45,  44,  44,  44,  44,  44,  43,  43,  43,  43,  43,
        43,  43,  43,  43,  42,  42,  42,  42,  42,  42,  41,  41,  41,
        41,  41,  41,  40,  40,  40,  39,  39,  39], dtype=int64)

It looks like the top 100 instructors taught 39 or more courses each, so that is where we will choose to bin.

In [144]:
# create empty list to store instructor_ids of instructors who taught more
# than 38 sections
instructor_100_list = []

# iterate over values and counts in 'instructor_id'
for value, count in df['instructor_id'].value_counts().items():
  
    # if count is more than 38 (i.e. instructor taught more than 38 sections)
    if count > 38:
    
        # append to list
        instructor_100_list.append(value)

# check
len(instructor_100_list)

106

In [145]:
# change value in 'instructor_id' column to 'other' if instructor_id is not
# on instructor_100_list 

df.loc[~df['instructor_id'].isin(instructor_100_list), 
       'instructor_id'] = 'other'

In [146]:
# check
df['instructor_id'].sample(10)

16743    157057.0
3359        other
30671       other
37419       other
14909       other
61442       other
30572       other
31343       other
66918       other
45628    566960.0
Name: instructor_id, dtype: object

In [155]:
# how many unique values now?
df['instructor_id'].nunique()

107

# 'course_name'

As with 'instructor_id', there are too many unique values in 'course_name'.

In [148]:
df['course_name'].nunique()

3816

In [149]:
# get numbers of rows for top 100 course_names
df['course_name'].value_counts().values[:100]

array([997, 810, 652, 491, 464, 463, 459, 418, 334, 330, 302, 300, 291,
       252, 248, 242, 241, 231, 221, 218, 195, 185, 185, 181, 174, 155,
       153, 151, 149, 145, 137, 136, 129, 124, 123, 119, 119, 119, 119,
       114, 112, 111, 105, 104, 101,  99,  95,  95,  94,  93,  93,  93,
        93,  92,  92,  92,  90,  90,  88,  87,  86,  86,  86,  85,  84,
        84,  83,  83,  83,  83,  81,  81,  78,  77,  76,  75,  75,  75,
        74,  73,  73,  73,  72,  72,  71,  70,  70,  69,  68,  68,  67,
        67,  67,  65,  64,  64,  63,  63,  62,  62], dtype=int64)

We will bin every course_name with fewer than 62 occurrences.

In [151]:
# create empty list to store course_names
course_name_100_list = []

# iterate over values and counts in 'course_name'
for value, count in df['course_name'].value_counts().items():
  
    # if count is more than 61 (ie course has been taught more than 61 times)
    if count > 61:
    
        # append to list
        course_name_100_list.append(value)

# this is the number of courses that have been taught more than 61 times
print(len(course_name_100_list))

103


In [152]:
# change 'course_name' column such that if 'course_name' is on 
# 'course_name_100_list', the course name stays, and if not, it gets changed
# to 'other'
df.loc[~df['course_name'].isin(course_name_100_list), 'course_name'] = 'other'

In [153]:
# check
df['course_name'].sample(10)

41936               technical presentations
16732               technical communication
48257                                 other
18964                  freshman composition
25151    introductory managerial accounting
17867                                 other
43631                       weight training
62611                                 other
59102                                 other
40872                                 other
Name: course_name, dtype: object

In [154]:
# how many unique values now in 'course_name'?
df['course_name'].nunique()

104

# 'course_offering_name' and 'course_offering_uuid'

In [156]:
# course_offering_name is correlated with course_name, so we will drop 
# it for now
# course_offering_uuid will not predict grades, so we will drop it for now
df.drop(columns = ['course_offering_name', 'course_offering_uuid'], 
        inplace = True)

In [157]:
# check
print('course_offering_name' in df.columns)
print('course_offering_uuid' in df.columns)

False
False


# 'section_uuid' and 'section_number'

In [159]:
# drop section_uuid and section_number, because they will not be predictive 
# of grades
df.drop(columns = ['section_uuid', 'section_number'], inplace = True)

In [160]:
# check
print('section_uuid' in df.columns)
print('section_number' in df.columns)

False
False


# Final checks

In [161]:
# check unique values and value_counts() in all columns
for col in df.columns:
  print(f"{col}:")
  print(f"unique values: {df[col].nunique()}")
  print(f"value counts: \n{df[col].value_counts(dropna = False)}")
  print()

section_type:
unique values: 6
value counts: 
lec    39649
lab     4905
sem     2607
ind     2182
fld      599
dis      264
Name: section_type, dtype: int64

instructor_id:
unique values: 107
value counts: 
other        44605
2601912.0      137
566960.0       132
496397.0       106
2601706.0       85
             ...  
631912.0        39
2600197.0       39
3659559.0       39
2600012.0       39
315329.0        39
Name: instructor_id, Length: 107, dtype: int64

facility_code:
unique values: 120
value counts: 
0482     7458
0469     4609
NaN      3976
0140     3449
0408     2538
         ... 
0092d       1
0039        1
1400k       1
1400g       1
0137        1
Name: facility_code, Length: 121, dtype: int64

start_time:
unique values: 118
value counts: 
 660.0    7660
 800.0    4448
 595.0    4372
-1.0      4091
 870.0    3633
          ... 
 965.0       1
 760.0       1
 440.0       1
 490.0       1
 550.0       1
Name: start_time, Length: 118, dtype: int64

mon:
unique values: 2
value c

In [162]:
# check length of df
len(df)

50206

In [163]:
# 'course_and_section' is uniquely identifying each row, so it can be dropped
df.drop(columns = 'course_and_section', inplace = True)

# check
'course_and_section' in df.columns

False

In [165]:
# reset index
df.reset_index(drop = True, inplace = True)

# check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50206 entries, 0 to 50205
Data columns (total 51 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   section_type       50206 non-null  object 
 1   instructor_id      50206 non-null  object 
 2   facility_code      46230 non-null  object 
 3   start_time         50206 non-null  float64
 4   mon                50206 non-null  bool   
 5   tues               50206 non-null  bool   
 6   wed                50206 non-null  bool   
 7   thurs              50206 non-null  bool   
 8   fri                50206 non-null  bool   
 9   sat                50206 non-null  bool   
 10  sun                50206 non-null  bool   
 11  subject_name       50206 non-null  object 
 12  course_name        50206 non-null  object 
 13  a_count            50206 non-null  int64  
 14  ab_count           50206 non-null  int64  
 15  b_count            50206 non-null  int64  
 16  bc_count           502

In [166]:
# inspect
df.sample(10)

Unnamed: 0,section_type,instructor_id,facility_code,start_time,mon,tues,wed,thurs,fri,sat,...,p_proportion,i_proportion,nw_proportion,nr_proportion,other_proportion,avg_letter_grade,year,term,class_length,course_difficulty
1238,lec,other,off campus,-1.0,False,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,3.933333,14,spring,0.0,elementary
45974,lec,other,0482,800.0,True,True,True,True,False,False,...,0.0,0.0,0.0,0.0,0.0,3.5,7,fall,50.0,elementary
42994,lec,other,0482,725.0,True,True,True,True,False,False,...,0.0,0.0,0.0,0.0,0.083333,3.136364,12,fall,50.0,elementary
31671,lec,980056.0,0481,660.0,True,False,True,False,False,False,...,0.0,0.0,0.0,0.0,0.0,3.315789,13,fall,50.0,elementary
19908,lec,other,0140,865.0,False,False,True,False,False,False,...,0.0,0.0,0.0,0.0,0.0,3.875,7,spring,155.0,intermediate
15943,lec,other,0482,800.0,True,False,True,False,True,False,...,0.0,0.0,0.0,0.0,0.0,3.652174,9,fall,50.0,intermediate
9708,fld,other,1480,480.0,True,False,False,False,False,False,...,0.0,0.0,0.0,0.0,0.0,3.925676,14,spring,50.0,intermediate
17075,sem,other,0046,960.0,False,True,False,True,False,False,...,0.0,0.042553,0.021277,0.0,0.0,3.284091,17,spring,75.0,intermediate
4106,lec,other,0407,870.0,False,True,False,True,False,False,...,0.0,0.0,0.0,0.0,0.0,3.604167,13,spring,75.0,intermediate
29868,lec,other,0046,570.0,False,True,False,True,False,False,...,0.0,0.0,0.0,0.0,0.0,3.388889,7,spring,75.0,intermediate


# Final data dictionary

- section_type: such as lecture, discussion, field, etc.
- instructor_id: instructor
- facility_code: building
- start_time: start time of class in minutes (-1: no start time assigned)
- mon, tues, wed, thurs, fri, sat, sun: if class meets on that day
- subject_name: subject
- year: academic year (calendar year of spring term)
- term: fall or spring
- course_name: same for same course across terms
- course_difficulty: based on course numbering system
- class_length: number of minutes between start and end time
- num_all_grades: number of all grades, letter or other, given in section
- count and proportion columns each for following grades: a, ab, b, bc, c, d, f, s, u, cr, n, p, i, nw, nr, other
- avg_letter_grade: average grade on 4.0 scale counting all letter grades awarded

# Check for (and delete) duplicates

In [167]:
# check for duplicates
df.duplicated().sum()

231

In [168]:
# drop
df.drop_duplicates(inplace = True)

In [169]:
# check again
df.duplicated().sum()

0

# Identify and address missing values

In [83]:
df.isna().sum()

section_type            0
instructor_id           0
facility_code        3491
start_time              0
mon                     0
tues                    0
wed                     0
thurs                   0
fri                     0
subject_name            0
course_name             0
a_proportion            0
f_proportion            0
avg_grade               0
year                    0
term                    0
class_length            0
total_time              0
weekend                 0
course_difficulty       0
dtype: int64

The only column missing values is 'facility_code'. This is the column that gives the building where the section was held. Missing values here could mean that the section was never assigned a classroom (i.e., that it was up to the instructor and students to arrange their own meeting place). 

In machine learning preprocessing, we will impute these missing values with the constant 'missing'. 

# Identify and correct inconsistencies in categorical values

In [84]:
df.dtypes

section_type          object
instructor_id         object
facility_code         object
start_time           float64
mon                   object
tues                  object
wed                   object
thurs                 object
fri                   object
subject_name          object
course_name           object
a_proportion         float64
f_proportion         float64
avg_grade            float64
year                  object
term                  object
class_length         float64
total_time             int64
weekend               object
course_difficulty     object
dtype: object

In [170]:
# check values in 'object' columns
dtypes = df.dtypes
object_dtypes = dtypes[dtypes == 'object']

for column in object_dtypes.index:
    print(column)
    print(df[column].unique())
    print()

section_type
['lec' 'ind' 'sem' 'lab' 'fld' 'dis']

instructor_id
['other' 2601642.0 309711.0 811223.0 2602070.0 886751.0 3659559.0 685141.0
 2600075.0 1112569.0 566960.0 3234517.0 377240.0 631912.0 3013497.0
 663146.0 464468.0 623858.0 965150.0 2601467.0 342827.0 2601912.0 984470.0
 964473.0 809212.0 777651.0 960897.0 4530799.0 3128595.0 344599.0 260106.0
 4232086.0 2601573.0 3357721.0 2600197.0 900201.0 1005245.0 412406.0
 4232087.0 922322.0 3382514.0 710039.0 783847.0 2600759.0 2601502.0
 496397.0 3793122.0 2601181.0 3673656.0 2600559.0 3615604.0 2600403.0
 423731.0 718608.0 819732.0 157057.0 2601527.0 650044.0 130429.0 1005574.0
 446645.0 573481.0 133526.0 692771.0 2601066.0 636841.0 3076440.0
 5476239.0 685944.0 4841799.0 1600563.0 3029337.0 980056.0 2600282.0
 315329.0 3105692.0 2600807.0 818841.0 806537.0 4124270.0 2601630.0
 3041045.0 2601318.0 710873.0 2600598.0 4539921.0 470031.0 3383097.0
 4083699.0 593987.0 565790.0 2602098.0 3216300.0 2601320.0 302280.0
 159304.0 2601242.0

In [171]:
# check values in numeric columns
df.describe(include = 'number')

Unnamed: 0,start_time,a_count,ab_count,b_count,bc_count,c_count,d_count,f_count,s_count,u_count,...,u_proportion,cr_proportion,n_proportion,p_proportion,i_proportion,nw_proportion,nr_proportion,other_proportion,avg_letter_grade,class_length
count,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,...,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49975.0,49286.0,49975.0
mean,672.910255,14.49979,8.032436,7.410965,2.744292,2.473037,0.695488,0.414787,0.54075,0.006343,...,0.00021,0.002403,0.005138,0.000371,0.004748,0.001013,0.001491,0.000642,3.517272,74.013607
std,245.748156,20.346604,13.11443,14.649307,6.937,7.614905,2.425147,1.373987,3.910638,0.087999,...,0.003887,0.035476,0.063172,0.01825,0.027221,0.011966,0.021864,0.010205,0.375103,50.271292
min,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,570.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.266667,50.0
50%,660.0,9.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5625,75.0
75%,865.0,16.0,9.0,7.0,2.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.833333,75.0
max,1260.0,694.0,234.0,219.0,125.0,157.0,44.0,30.0,165.0,6.0,...,0.142857,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,600.0


These values look reasonable.

- 'start_time':
  - Values of -1 mean that there is no assigned start time for the course.
  - The max start time is 1260 minutes, which is 9pm.

- The proportion columns are between 0 and 1.

- 'avg_letter_grade' is between 0 and 4.

- 'class_length' and 'total_time':
  - The min values are 0, which represent courses that were not assigned a meeting time/schedule.
  - The max values are interesting. A 'class_length' value of 600 minutes (10 hours) is unusual. College courses rarely (if ever) meet for 10 hours per week, much less 10 hours at a time. 

In [172]:
df[df['class_length'] == 600]

Unnamed: 0,section_type,instructor_id,facility_code,start_time,mon,tues,wed,thurs,fri,sat,...,p_proportion,i_proportion,nw_proportion,nr_proportion,other_proportion,avg_letter_grade,year,term,class_length,course_difficulty
8387,sem,other,,480.0,False,False,True,False,False,False,...,0.0,0.0,0.0,0.0,0.0,4.0,11,spring,600.0,advanced
9055,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.9375,10,fall,600.0,advanced
9056,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.892857,11,fall,600.0,advanced
9057,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.772727,11,fall,600.0,advanced
45792,fld,other,off campus,480.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,4.0,17,spring,600.0,advanced
45793,fld,other,off campus,480.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,4.0,16,spring,600.0,advanced
48081,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.964286,10,fall,600.0,advanced
48082,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.875,11,fall,600.0,advanced
48091,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.916667,10,fall,600.0,advanced
48092,fld,685944.0,off campus,420.0,True,True,True,True,True,False,...,0.0,0.0,0.0,0.0,0.0,3.958333,11,fall,600.0,advanced


It looks like all of the courses with 600 minutes as their 'class_length' are all at the advanced level and are all but one field classes. Based on their course names, I expect these are internship-like courses, where the student is expected to block off the entire workweek for the course and intern in an off-campus location for credit. This seems plausible, so I will keep these values.

The course that isn't a field course is the seminar in international studies that meets for 600 minutes on one day of the week. This seems very unusual, but without any more information about it, I will keep it. 

# Export to csv

In [174]:
df.to_csv('Data/all_grades_data_cleaned.csv')