# Data Cleaning

By Kenneth Burchfiel

Released under the MIT License

Our test_results database file already contains results for the fall and the spring. However, let's say that you've been asked to add a set of winter results to this dataset as well, then calculate a weighted average of fall, winter, and spring results for each student. 

If these results were in the same format as the fall and spring ones and had no missing data, this process would be very simple. Unfortunately, that's not the case with the fictional winter results that we'll be processing within this script. These results feature:

1. Different column names
2. Different value formats
3. Missing columns
4. Duplicate values
5. Missing results for certain students

And to make matters even more complex, these winter results are spread out over 24 different files (one for each school/test day pair).

It would be cumbersome and mind-numbing to modify each of these 24 datasets within Excel so that they could be combined with our pre-existing fall and spring data. However, the Python code shown below will make this data cleaning process much easier. And once this script is in place, if you happened to get next year's winter results in the same format* as this year's, you'd be able to get them cleaned up in no time.

\*You may find in your work, however, that the results are in yet another format the following year, followed by a different format the year after that. Data-related tasks are always made easier when inputs stay the same, but in the real world, you'll often need to rework datasets in order to make them compatible with pre-existing processes. 

In [1]:
import os 
import pandas as pd
import numpy as np
import sqlalchemy
pfn_db_engine = sqlalchemy.create_engine(
'sqlite:///'+'../data/network_database.db')

# Part 1: Data Cleaning

## Locating our data

We can use Python's os library to create a list of our 24 winter test files. These are stored in a folder whose same contains the testing period (e.g. winter), starting school year (2023), and ending school year (2024). We can use Python to retrieve the testing period and starting school year from these files so that they can then be added into our reformatted datasets.

In [2]:
result_folder_name = 'winter_2023_2024_test_results/'
# Adding a / to the end of this path here will prevent us
# from having to add a '/' before each filename when reading
# in results from this folder.

# Retrieving the testing period and year from the folder name:
# (We could also just hard-code these variables, but we'd then 
# need to remember to update those variables in future years;
# if we forgot, we'd then end up with incorrect data. 

period = result_folder_name.split('_')[0].title() # Title converts
# 'winter' to 'Winter' so that its format will match the 'Fall'
# and 'Spring values already found in the dataset.
starting_year = result_folder_name.split('_')[1].split('_')[0]
print("Period:",period, "Starting year:",starting_year) # Ensuring that our
# split() operations were successful

result_file_list = os.listdir(result_folder_name)
result_file_list

Period: Winter Starting year: 2023


['CA Test Day 1 Results.csv',
 'CA Test Day 2 Results.csv',
 'CA Test Day 3 Results.csv',
 'CA Test Day 4 Results.csv',
 'CA Test Day 5 Results.csv',
 'CA Test Day 6 Results.csv',
 'DA Test Day 1 Results.csv',
 'DA Test Day 2 Results.csv',
 'DA Test Day 3 Results.csv',
 'DA Test Day 4 Results.csv',
 'DA Test Day 5 Results.csv',
 'DA Test Day 6 Results.csv',
 'HA Test Day 1 Results.csv',
 'HA Test Day 2 Results.csv',
 'HA Test Day 3 Results.csv',
 'HA Test Day 4 Results.csv',
 'HA Test Day 5 Results.csv',
 'HA Test Day 6 Results.csv',
 'SA Test Day 1 Results.csv',
 'SA Test Day 2 Results.csv',
 'SA Test Day 3 Results.csv',
 'SA Test Day 4 Results.csv',
 'SA Test Day 5 Results.csv',
 'SA Test Day 6 Results.csv']

We'll now build out code that converts one of these files into a format compatible with our pre-existing fall and spring results. Once this code is in place, we'll then convert it into a function that can be applied to all 24 results.

In [3]:
# Selecting our result file:
result_file = result_file_list[0]

# Retrieving school and starting school year data from this file:
school = result_file.split(' ')[0]
# The original results didn't have any test date data, but 
# we'll need this information in order to remove duplicate
# results from our dataset.
test_day = result_file.split(' ')[3]
print("School:",school, "Test Day:", test_day)

# Importing result_file's contents into a DataFrame:
df_result = pd.read_csv(result_folder_name + result_file)
df_result

School: CA Test Day: 1


Unnamed: 0,Score,Identification Code,Student's Grade
0,30%,ID:40-865,4th Grade
1,44%,ID:41-976,2nd Grade
2,61%,ID:42-194,7th Grade
3,45%,ID:41-150,3rd Grade
4,48%,ID:43-342,5th Grade
...,...,...,...
128,37%,ID:42-380,9th Grade
129,34%,ID:42-026,1st Grade
130,47%,ID:41-131,11th Grade
131,63%,ID:41-915,10th Grade


Let's compare this format with that found in our test_results table:

In [4]:
df_test_results = pd.read_sql("Select * from test_results", 
con = pfn_db_engine)
df_test_results

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score
0,42026,CA,1,2023,Fall,47
1,43491,CA,1,2023,Fall,49
2,41637,CA,1,2023,Fall,57
3,40365,CA,1,2023,Fall,63
4,41516,CA,1,2023,Fall,51
...,...,...,...,...,...,...
7995,41060,SA,K,2023,Spring,58
7996,43942,SA,K,2023,Spring,50
7997,40479,SA,K,2023,Spring,60
7998,41160,SA,K,2023,Spring,52


Oh boy! Only one column ('Score') has the same name in both datasets; the scores, IDs, and grades are formatted differently; and we're missing the 'Starting_Year', 'Period', and 'School' columns. Fortunately, we can resolve all of these issues via Python, as shown below.

## Cleaning a single file

We'll start by renaming our 'Identification Code' and 'Student's Grade' columns so that they match the names found in our test_results table:

In [5]:
df_result.rename(columns = {'Identification Code':'Student_ID', 
                            "Student's Grade":'Grade'}, inplace = True)
df_result

Unnamed: 0,Score,Student_ID,Grade
0,30%,ID:40-865,4th Grade
1,44%,ID:41-976,2nd Grade
2,61%,ID:42-194,7th Grade
3,45%,ID:41-150,3rd Grade
4,48%,ID:43-342,5th Grade
...,...,...,...
128,37%,ID:42-380,9th Grade
129,34%,ID:42-026,1st Grade
130,47%,ID:41-131,11th Grade
131,63%,ID:41-915,10th Grade


### Reformatting column values

Next, we'll change the formats of these columns' values so that they match up with those found in our test_results table.

In [6]:
# Converting 'Score' values from strings with percentage symbols to integers:
df_result['Score'] = df_result['Score'].str.replace(
    '%', '').astype('int')
df_result.head(3)

Unnamed: 0,Score,Student_ID,Grade
0,30,ID:40-865,4th Grade
1,44,ID:41-976,2nd Grade
2,61,ID:42-194,7th Grade


We can convert the Grade column values from long names (e.g. '4th Grade', 'Kindergarten') to short ones (e.g. '4', 'K') by keeping only the first character within each row: 

In [7]:
df_result['Grade'] = df_result['Grade'].str[0]
df_result

Unnamed: 0,Score,Student_ID,Grade
0,30,ID:40-865,4
1,44,ID:41-976,2
2,61,ID:42-194,7
3,45,ID:41-150,3
4,48,ID:43-342,5
...,...,...,...
128,37,ID:42-380,9
129,34,ID:42-026,1
130,47,ID:41-131,1
131,63,ID:41-915,1


In order to convert the Student ID values in this table from strings to integers, we could try using index operators ([]) to access each group of numbers. However, in some datasets, the numbers are prefaced by 'Student ID' instead of 'ID'. Plus, in future years, the number of characters before and after each hyphen might change, which would further intefere with our efforts to use index operators. Therefore, we'll instead use a regex operation to remove all non-integer characters with empty strings, thus filtering each row within this column to include only ID values.

In [8]:
df_result['Student_ID'] = df_result['Student_ID'].str.replace(
    '\D','', regex = True).astype('int')
# \D is a code for non-digit characters 
# (see https://docs.python.org/3/library/re.html),
# so the above line replaces all non-digit characters (like 'ID' and '-')
# with empty strings.

# This line assumes that no digits exist within the non-ID characters in
# this column. If such characters did exist, we could instead use the following code
# to keep only student IDs within this column:

# df_result['Student_ID'] = (df_result['Student_ID'].str.split(
#     ':').str[1].str.split('-').str[0] 
#  + df_result['Student_ID'].str.split(
#      ':').str[1].str.split(
#      '-').str[1]).astype('int')

# However, this approach is much more cumbersome and would also fail to work
# if the colons and hyphens found in the column were replaced by some other
# format in the future. 


  '\D','', regex = True).astype('int')


In [9]:
df_result

Unnamed: 0,Score,Student_ID,Grade
0,30,40865,4
1,44,41976,2
2,61,42194,7
3,45,41150,3
4,48,43342,5
...,...,...,...
128,37,42380,9
129,34,42026,1
130,47,41131,1
131,63,41915,1


## Adding in values from file and folder names:

This dataset doesn't contain school, period, or year information; instead, that information was stored in its filename and the folder in which it is located. Therefore, we'll now fill in that data using the school, school year, and period variables that we created earlier:

In [10]:
df_result['School'] = school
df_result['Period'] = period
df_result['Starting_Year'] = starting_year

# Our original test_results table didn't have test date information,
# but we'll add in those values as well to assist with an upcoming
# duplicate removal task.

df_result['Test_Day'] = test_day

df_result

Unnamed: 0,Score,Student_ID,Grade,School,Period,Starting_Year,Test_Day
0,30,40865,4,CA,Winter,2023,1
1,44,41976,2,CA,Winter,2023,1
2,61,42194,7,CA,Winter,2023,1
3,45,41150,3,CA,Winter,2023,1
4,48,43342,5,CA,Winter,2023,1
...,...,...,...,...,...,...,...
128,37,42380,9,CA,Winter,2023,1
129,34,42026,1,CA,Winter,2023,1
130,47,41131,1,CA,Winter,2023,1
131,63,41915,1,CA,Winter,2023,1


We have now cleaned up our dataset! Its columns don't match the order of those found in `test_results`, but this won't necessarily cause any issues, as Pandas will automatically match these columns to those found in `test_results` if we try adding the two datasets together:

In [11]:
pd.concat([df_test_results, df_result]) # Note that our new Winter data,
# found at the bottom of this table, matches the column order found
# in the test_result table.

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score,Test_Day
0,42026,CA,1,2023,Fall,47,
1,43491,CA,1,2023,Fall,49,
2,41637,CA,1,2023,Fall,57,
3,40365,CA,1,2023,Fall,63,
4,41516,CA,1,2023,Fall,51,
...,...,...,...,...,...,...,...
128,42380,CA,9,2023,Winter,37,1
129,42026,CA,1,2023,Winter,34,1
130,41131,CA,1,2023,Winter,47,1
131,41915,CA,1,2023,Winter,63,1


However, users who plan to process this data in a spreadsheet editor might still prefer to see the same column order that `test_results` features. Therefore, we can reorder our columns to match that order by storing the database table's columns within a list; adding our new 'Test_Day' column to the end; and then reordering the columns using this list.

In [12]:
# Creating a list of the colunms in df_test_results along with our new 'Test_Day' column:
column_order = list(df_test_results.columns) + ['Test_Day']
column_order

['Student_ID',
 'School',
 'Grade',
 'Starting_Year',
 'Period',
 'Score',
 'Test_Day']

In [13]:
# Replacing df_result with a new copy that uses the column orders stored
# in column_order:
df_result = df_result[column_order].copy()
df_result

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score,Test_Day
0,40865,CA,4,2023,Winter,30,1
1,41976,CA,2,2023,Winter,44,1
2,42194,CA,7,2023,Winter,61,1
3,41150,CA,3,2023,Winter,45,1
4,43342,CA,5,2023,Winter,48,1
...,...,...,...,...,...,...,...
128,42380,CA,9,2023,Winter,37,1
129,42026,CA,1,2023,Winter,34,1
130,41131,CA,1,2023,Winter,47,1
131,41915,CA,1,2023,Winter,63,1


## Converting our data cleaning operations into a function

Now that we've confirmed that our data cleaning code works for a single dataset, we can now convert it into a function, then call that function for all 24 datasets. We'll add each DataFrame created by this function to a list that can in turn be converted into a single Winter results DataFrame.

In [14]:
def reformat_results(result_file):
    '''This function converts results into a format compatible with our main test_results table.
    result_file: the name of the file that contains the results to be reformatted.
    Documentation for this function can be found earlier within this notebook.
    '''

    school = result_file.split(' ')[0]
    test_day = result_file.split(' ')[3]
    print(f"Now processing results for Test Day {test_day} at {school}.")
    
    df_result = pd.read_csv(result_folder_name + result_file)
    df_result.rename(columns = {'Identification Code':'Student_ID', 
                            "Student's Grade":'Grade'}, inplace = True)
        
    df_result['Score'] = df_result['Score'].str.replace(
        '%', '').astype('int')
    df_result['Grade'] = df_result['Grade'].str[0]
    df_result['Student_ID'] = df_result['Student_ID'].str.replace(
    '\D','', regex = True).astype('int')

    df_result['School'] = school
    df_result['Period'] = period
    df_result['Starting_Year'] = starting_year

    df_result['Test_Day'] = test_day

    df_result = df_result[column_order].copy()
    return df_result

  '\D','', regex = True).astype('int')


## Creating a single DataFrame that stores reformatted results for all schools

The following cell calls reformat_results for all files within result_file_list, then adds those results together into a single DataFrame.

In [15]:
reformatted_df_list = []
for result_file in result_file_list:
    reformatted_df_list.append(reformat_results(result_file))

# The following line uses pd.concat() and a list comprehension
# to combine all of these results into one DataFrame.
df_reformatted_results = pd.concat([df for df in reformatted_df_list])
df_reformatted_results

Now processing results for Test Day 1 at CA.
Now processing results for Test Day 2 at CA.
Now processing results for Test Day 3 at CA.
Now processing results for Test Day 4 at CA.
Now processing results for Test Day 5 at CA.
Now processing results for Test Day 6 at CA.
Now processing results for Test Day 1 at DA.
Now processing results for Test Day 2 at DA.
Now processing results for Test Day 3 at DA.
Now processing results for Test Day 4 at DA.
Now processing results for Test Day 5 at DA.
Now processing results for Test Day 6 at DA.
Now processing results for Test Day 1 at HA.
Now processing results for Test Day 2 at HA.
Now processing results for Test Day 3 at HA.
Now processing results for Test Day 4 at HA.
Now processing results for Test Day 5 at HA.
Now processing results for Test Day 6 at HA.
Now processing results for Test Day 1 at SA.
Now processing results for Test Day 2 at SA.
Now processing results for Test Day 3 at SA.
Now processing results for Test Day 4 at SA.
Now proces

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score,Test_Day
0,40865,CA,4,2023,Winter,30,1
1,41976,CA,2,2023,Winter,44,1
2,42194,CA,7,2023,Winter,61,1
3,41150,CA,3,2023,Winter,45,1
4,43342,CA,5,2023,Winter,48,1
...,...,...,...,...,...,...,...
63,43134,SA,3,2023,Winter,24,6
64,42465,SA,2,2023,Winter,42,6
65,42209,SA,4,2023,Winter,72,6
66,41380,SA,4,2023,Winter,34,6


## Removing duplicates

This new DataFrame is almost ready to be integrated into our original set of results. However, some students have multiple winter test results* due to test retakes, which could cause issues when merging these results into other tables and when calculating aggregate results. (A student with 2 results will get weighted twice as heavily as one with just one result.)

In order to address these multiple test entries, we'll sort the table by test day, then retain only the last (e.g. most recent) result for each test taker. 

\* I know that this is the case because I put this simulated data together. In real-world situations, though, it's always a good idea to check for duplicate results. An example of how to do so can be found below.

Identifying duplicate results:

In [16]:
df_reformatted_results[
df_reformatted_results.duplicated(
    subset = 'Student_ID', keep = False)].sort_values(['Student_ID', 'Test_Day'])
# Within duplicated(), keep = False causes all duplicate rows to be shown.

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score,Test_Day
39,40009,CA,9,2023,Winter,28,1
26,40009,CA,9,2023,Winter,29,6
127,40057,HA,6,2023,Winter,52,2
40,40057,HA,6,2023,Winter,55,6
118,40060,HA,6,2023,Winter,73,5
...,...,...,...,...,...,...,...
21,43893,SA,1,2023,Winter,57,6
42,43956,HA,7,2023,Winter,55,2
33,43956,HA,7,2023,Winter,53,6
70,43992,SA,1,2023,Winter,45,2


## Removing duplicate results:

Within drop_duplicates(), 'subset' specifies the column (or columns) to check for duplicate values and `keep = 'last'` instructs Python to preserve the lowest result. Because we sorted our DataFrame by Test_Day, the lowest result will always be the most recent one. 

In [17]:
df_reformatted_results = df_reformatted_results.sort_values(
    'Test_Day').drop_duplicates(
    subset = 'Student_ID', keep = 'last').copy().reset_index(drop=True)

# Now that we've removed duplicate records, we can drop the 'Test_Day'
# column from the dataset, as it doesn't show up in our fall/spring results:

df_reformatted_results.drop('Test_Day', axis = 1, inplace = True)

# Here's the duplicate-free version of the DataFrame: (note that the row count
# is now lower.)
df_reformatted_results

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score
0,40865,CA,4,2023,Winter,30
1,40177,DA,4,2023,Winter,60
2,40864,DA,1,2023,Winter,58
3,42109,HA,5,2023,Winter,40
4,41395,HA,2,2023,Winter,32
...,...,...,...,...,...,...
2921,42941,DA,1,2023,Winter,57
2922,42549,HA,K,2023,Winter,47
2923,41307,CA,5,2023,Winter,41
2924,41380,SA,4,2023,Winter,34


# Part 2: Combining Test Results and Calculating Weighted Averages

We'll now combine our reformatted and duplicate-free winter results with our fall and spring results. Once all results are present within the same dataset, we'll be able to calculate weighted averages of these results. However, we'll also need to ensure that missing winter results for certain students don't produce inaccurate averages.

First, we'll call pd.concat() to create a combined fall, spring, and winter dataset:

In [18]:
df_combined_results = pd.concat(
    [df_test_results, df_reformatted_results])
df_combined_results

Unnamed: 0,Student_ID,School,Grade,Starting_Year,Period,Score
0,42026,CA,1,2023,Fall,47
1,43491,CA,1,2023,Fall,49
2,41637,CA,1,2023,Fall,57
3,40365,CA,1,2023,Fall,63
4,41516,CA,1,2023,Fall,51
...,...,...,...,...,...,...
2921,42941,DA,1,2023,Winter,57
2922,42549,HA,K,2023,Winter,47
2923,41307,CA,5,2023,Winter,41
2924,41380,SA,4,2023,Winter,34


Next, we'll create a 'wide' version of this DataFrame that stores fall, winter, and spring results on the same row for each student.

In [19]:
df_weighted_average = df_combined_results.copy(
).pivot(index = 'Student_ID', columns = 'Period', 
        values = 'Score').reset_index()

# The following code moves the 'Spring' column to the end of 
# the dataset. This step isn't necessary for our calculations, 
# but our end users might prefer
# to see the test data in chronological order. 
# Using len(df_weighted_average.columns) -1 as the position of the inserted
# column rather than a hard-coded number makes this code a bit more
# flexible.

df_weighted_average.insert(len(df_weighted_average.columns) -1, 
    'Spring', df_weighted_average.pop('Spring'))

df_weighted_average

Period,Student_ID,Fall,Winter,Spring
0,40001,62.0,,58.0
1,40002,35.0,21.0,75.0
2,40003,33.0,30.0,40.0
3,40004,42.0,30.0,65.0
4,40005,49.0,46.0,67.0
...,...,...,...,...
3995,43996,38.0,,70.0
3996,43997,62.0,55.0,41.0
3997,43998,35.0,,36.0
3998,43999,43.0,46.0,56.0


Every student has a valid Fall and Spring test result record; however, quite a few students have missing winter records, as the following line shows:

In [20]:
print(f"There are {len(
    df_weighted_average.query("Winter.isna()"))} \
missing winter results in our dataset.")

There are 1074 missing winter results in our dataset.


We'll need to make sure that these missing results don't (1) prevent weighted averages from being calcualted for some students or (2), worse yet, produce inaccurately low overall scores.

## Calculating our weighted averages:

In order to give more recent test results more prominence in each student's overall score, we'll have fall results account for 20% of a student's overall score; winter results account for 30%; and spring results account for 50%. These values can be stored within our code as follows:

In [21]:
weight_dict = {'Fall':0.2, 'Winter':0.3, 'Spring':0.5}

First, for comparison purposes, we'll create an unweighted average (e.g. one that attaches equal importance to each score). Note that Pandas automatically skips over missing (NaN) Winter values when calculating means.

In [22]:
df_weighted_average['Unweighted_Avg'] = df_weighted_average[
['Fall', 'Spring', 'Winter']].mean(axis=1)
df_weighted_average

Period,Student_ID,Fall,Winter,Spring,Unweighted_Avg
0,40001,62.0,,58.0,60.000000
1,40002,35.0,21.0,75.0,43.666667
2,40003,33.0,30.0,40.0,34.333333
3,40004,42.0,30.0,65.0,45.666667
4,40005,49.0,46.0,67.0,54.000000
...,...,...,...,...,...
3995,43996,38.0,,70.0,54.000000
3996,43997,62.0,55.0,41.0,52.666667
3997,43998,35.0,,36.0,35.500000
3998,43999,43.0,46.0,56.0,48.333333


Next, we'll try 3 times to create a weighted average column. (The third time's the charm, right?)

The first approach, shown below, multiplies each student's fall, winter, and spring test results by the weight we've assigned those results. The problem with this approach is that, if one of the results is NaN (missing), the final weighted average will also be NaN. This approach is inadequate for our sitaution, so we'll call the column it produces 'Bad_Weighted_Avg.'

In [23]:
df_weighted_average['Bad_Weighted_Avg'] = (
    df_weighted_average['Fall'] * weight_dict['Fall']
    + df_weighted_average['Winter'] * weight_dict['Winter'] +
    df_weighted_average['Spring'] * weight_dict['Spring'])

df_weighted_average

Period,Student_ID,Fall,Winter,Spring,Unweighted_Avg,Bad_Weighted_Avg
0,40001,62.0,,58.0,60.000000,
1,40002,35.0,21.0,75.0,43.666667,50.8
2,40003,33.0,30.0,40.0,34.333333,35.6
3,40004,42.0,30.0,65.0,45.666667,49.9
4,40005,49.0,46.0,67.0,54.000000,57.1
...,...,...,...,...,...,...
3995,43996,38.0,,70.0,54.000000,
3996,43997,62.0,55.0,41.0,52.666667,49.4
3997,43998,35.0,,36.0,35.500000,
3998,43999,43.0,46.0,56.0,48.333333,50.4


The next approach is even worse! It fills in missing values with 0s, which solves the problem of NaN weighted average values but creates an even worse error. Now, students with missing winter scores will receive inaccurately low weighted averages as a result. (For example, a student with a fall result of 62 and a spring result of 58 will get a score of 41.4 (62*0.2 + 0*0.3 + 58*0.5), which certainly isn't fair.) We'l name the column that stores these values 'Terrible_Weighted_Avg' due to this bug.

In [24]:
df_weighted_average['Terrible_Weighted_Avg'] = (
    df_weighted_average['Fall'].fillna(0) * weight_dict['Fall']
    + df_weighted_average['Winter'].fillna(0) * weight_dict['Fall'] 
    + df_weighted_average['Spring'].fillna(0) * weight_dict['Spring'])
df_weighted_average

Period,Student_ID,Fall,Winter,Spring,Unweighted_Avg,Bad_Weighted_Avg,Terrible_Weighted_Avg
0,40001,62.0,,58.0,60.000000,,41.4
1,40002,35.0,21.0,75.0,43.666667,50.8,48.7
2,40003,33.0,30.0,40.0,34.333333,35.6,32.6
3,40004,42.0,30.0,65.0,45.666667,49.9,46.9
4,40005,49.0,46.0,67.0,54.000000,57.1,52.5
...,...,...,...,...,...,...,...
3995,43996,38.0,,70.0,54.000000,,42.6
3996,43997,62.0,55.0,41.0,52.666667,49.4,43.9
3997,43998,35.0,,36.0,35.500000,,25.0
3998,43999,43.0,46.0,56.0,48.333333,50.4,45.8


The final approach creates a better set of weighted averages by first calculating individual weights for each student. These weights will be equal to the original weights (0.2 for fall, 0.3 for winter, and 0.5 for spring) if data are present for those periods *and 0 otherwise*. The sum of these student-level weights will get stored in a 'Total_Weight' column.

Next, as in the previous example, the script will calculate the sum of the products of each test score with its corresponding weight; fillna(0) will be used to prevent missing data from producing an NaN sum. However, the script then divides this sum by the total_weight column in order to adjust for missing results.

In [25]:
# Calculating student-level weight values:
for column in ['Fall', 'Winter', 'Spring']:
    df_weighted_average[column+'_Weight_Val'] = np.where(
        df_weighted_average[column].isna() == False, weight_dict[column], 0)

# Adding these weights together:
df_weighted_average['Total_Weight'] = df_weighted_average[
[column for column in df_weighted_average if 'Weight_Val' in column]].sum(axis = 1)

# Calculating an adjusted weighted average that takes missing values into account:
df_weighted_average['Weighted_Avg'] = ((
    df_weighted_average['Fall'].fillna(0) * weight_dict['Fall']
    + df_weighted_average['Winter'].fillna(0) * weight_dict['Fall'] 
    + df_weighted_average['Spring'].fillna(0) * weight_dict['Spring']) 
    / df_weighted_average['Total_Weight'])
    # Dividing the sum of the score/weight products by Total_Weight allows us
    # to adjust for missing data.

df_weighted_average

Period,Student_ID,Fall,Winter,Spring,Unweighted_Avg,Bad_Weighted_Avg,Terrible_Weighted_Avg,Fall_Weight_Val,Winter_Weight_Val,Spring_Weight_Val,Total_Weight,Weighted_Avg
0,40001,62.0,,58.0,60.000000,,41.4,0.2,0.0,0.5,0.7,59.142857
1,40002,35.0,21.0,75.0,43.666667,50.8,48.7,0.2,0.3,0.5,1.0,48.700000
2,40003,33.0,30.0,40.0,34.333333,35.6,32.6,0.2,0.3,0.5,1.0,32.600000
3,40004,42.0,30.0,65.0,45.666667,49.9,46.9,0.2,0.3,0.5,1.0,46.900000
4,40005,49.0,46.0,67.0,54.000000,57.1,52.5,0.2,0.3,0.5,1.0,52.500000
...,...,...,...,...,...,...,...,...,...,...,...,...
3995,43996,38.0,,70.0,54.000000,,42.6,0.2,0.0,0.5,0.7,60.857143
3996,43997,62.0,55.0,41.0,52.666667,49.4,43.9,0.2,0.3,0.5,1.0,43.900000
3997,43998,35.0,,36.0,35.500000,,25.0,0.2,0.0,0.5,0.7,35.714286
3998,43999,43.0,46.0,56.0,48.333333,50.4,45.8,0.2,0.3,0.5,1.0,45.800000


Now that we've created our weighted averages, we'll drop the columns storing our first two attempts and save the output to a .csv file.

In [26]:
df_weighted_average.drop(['Bad_Weighted_Avg', 'Terrible_Weighted_Avg'], axis = 1, inplace = True)
df_weighted_average.to_csv('weighted_averages.csv', index = False)
df_weighted_average

Period,Student_ID,Fall,Winter,Spring,Unweighted_Avg,Fall_Weight_Val,Winter_Weight_Val,Spring_Weight_Val,Total_Weight,Weighted_Avg
0,40001,62.0,,58.0,60.000000,0.2,0.0,0.5,0.7,59.142857
1,40002,35.0,21.0,75.0,43.666667,0.2,0.3,0.5,1.0,48.700000
2,40003,33.0,30.0,40.0,34.333333,0.2,0.3,0.5,1.0,32.600000
3,40004,42.0,30.0,65.0,45.666667,0.2,0.3,0.5,1.0,46.900000
4,40005,49.0,46.0,67.0,54.000000,0.2,0.3,0.5,1.0,52.500000
...,...,...,...,...,...,...,...,...,...,...
3995,43996,38.0,,70.0,54.000000,0.2,0.0,0.5,0.7,60.857143
3996,43997,62.0,55.0,41.0,52.666667,0.2,0.3,0.5,1.0,43.900000
3997,43998,35.0,,36.0,35.500000,0.2,0.0,0.5,0.7,35.714286
3998,43999,43.0,46.0,56.0,48.333333,0.2,0.3,0.5,1.0,45.800000


That's it for this lesson! Data cleaning isn't always fun*, but you'll find it to be a crucial prerequisite for many of your own Python projects.

*However, cleaning data in Python is still more fun than cleaning it in a spreadsheet editor--especially when you're taksed with cleaning the same type of data over and over again.