# Data Prep

By Kenneth Burchfiel

Released under the MIT License

Before a dataset can be analyzed and visualized within Python, it often needs to be reformatted and cleaned. This script will clean and reformat our NVCU student survey results, then merge in data from a separate table in order to demonstrate how Python can easily perform these reformatting and cleaning tasks. 

Our survey_results database file already contains student survey responses for the fall and spring. However, let's say that you've been asked to add a set of winter results to this dataset as well, then calculate a weighted average of fall, winter, and spring survey results for each student. 

If these results were in the same format as the fall and spring ones and had no missing data, this process would be very simple. Unfortunately, that's not the case with the winter results that we'll be processing within this script. These results feature:

1. Column names that differ from those in the fall/spring results
2. Different data formats
3. A missing column
4. Duplicate values
5. Missing values for certain students
6. Results spread over 16 separate files (one for each school/level pair)

And to make matters even more complex, these winter results are spread out over 16 different files (one for each level within each college).

It would be cumbersome and mind-numbing to modify each of these 16 datasets within Excel, Google Sheets, or a similar program so that they could be combined with our pre-existing fall and spring data. However, the Python code shown below will make this data cleaning process much easier. And once this script is in place, if you happened to get next year's winter results in the same format* as this year's, you'd be able to get them cleaned up and reformatted in no time.

*\*You may find in your work, however, that the results are in yet another format the following year, followed by a different format the year after that. Data-related tasks are always made easier when inputs stay the same, but in the real world, you'll often need to rework datasets in order to make them compatible with pre-existing processes.*

In [None]:
import os
import pandas as pd
from sqlalchemy import create_engine

# Cleaning and reformatting winter survey data

Our first step in preparing our winter survey results will be to import the 16 files that comprise them into a DataFrame. We'll first use os.listdir() to create a list of all files within our winter_results folder:

In [None]:
file_list = os.listdir('winter_results')
file_list

Next, we'll use a for loop to read each file within this list into a DataFrame. We'll then apply pd.concat() to combine these results into a single DataFrame.

In [None]:
df_list = []
for file in file_list:
    df = pd.read_csv(f'winter_results/{file}')
    df_list.append(df)
df_winter_results = pd.concat(
    [df for df in df_list]) # df for df in df_list is a list comprehension
# that contains all DataFrames in df_list.
df_winter_results.reset_index(drop=True,inplace=True)
df_winter_results

The following cell shows a more concise means of creating the same DataFrame. Although this approach requires fewer lines of code, it's also less flexible (as the former method allows you to make individual updates to each DataFrame if needed).

In [None]:
df_winter_results = pd.concat(
    [pd.read_csv(f'winter_results/{file}') 
     for file in os.listdir('winter_results')]).reset_index(drop=True)
df_winter_results

## Reformatting and cleaning our dataset

Our next step is to combine these winter survey results with the fall and spring results in our NVCU database. Here's what those results look like:

In [None]:
# Connecting to our database:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')
df_fall_spring_results = pd.read_sql(
    "Select * from survey_results", con = e)
df_fall_spring_results

If we naively tried to add our winter results to our fall/spring results, we'd end up with a very messy DataFrame with numerous blank cells:

In [None]:
pd.concat([df_fall_spring_results, df_winter_results])

This messy output is caused by discrepancies in column names between the two tables. We'll need to rename our winter results fields to match their corresponding fields within the fall/spring table. Thankfully, Pandas makes this process very straightforward:

In [None]:
df_winter_results.rename(columns = {
    'SEASON':'season','STARTINGYR':'starting_year',
    'SURVEY_SCORE':'score'}, inplace = True)
df_winter_results

We'll also need to convert our `MATRIC#` and `MATRICYR` fields into a single `student_id` field. (This student_id field simply combines students' matriculation years with their matriculation numbers; see nvcu_db_gen.ipynb within the Appendix for more details.) This can be done as follows:

In [None]:
df_winter_results['MATRICYR'] += 2000 # Converts our MATRICYR
# values from YY to YYYY format so that they'll match the format of the 
# matriculation year component of the student_id values within 
# df_fall_spring_results

# Converting students' MATRICYR and MATRIC# values into student IDs:
# (Note that both columns must be converted to strings in order for
# this code to work.)
df_winter_results['student_id'] = (
    df_winter_results['MATRICYR'].astype('str') 
    + '-' 
    + df_winter_results['MATRIC#'].astype('str'))
# Now that we've used our MATRICYR and MATRIC# columns to create 
# our student IDs, we no longer need to retain those columns:
df_winter_results.drop(
    ['MATRICYR', 'MATRIC#'], 
    axis = 1, inplace = True)
df_winter_results

The columns in df_winter_results now match those within df_fall_spring_results. That's great! Let's try combining the two datasets to see if we're ready to perform analyses on them:

In [None]:
df_results = pd.concat([df_fall_spring_results, 
           df_winter_results])
df_results.head()

This output shows that, unfortunately, we're not quite ready to analyze this data just yet: there are several formatting differences that we'll need to address. 

For instance, the 'score' column within df_fall_spring_results uses an integer format, whereas these same numbers are formatted as strings within df_winter_results. This will produce errors when we attempt to perform numerical calculations on this field:

In [None]:
# df_results['score'].mean() 
# Raises a TypeErorr: "unsupported operand type(s) for +: 'int' and 'str'"

The following cell resolves this issue by converting our string-formatted score values to integers:

In [None]:
# Converting our score values to integers:
df_winter_results['score'] = df_winter_results[
'score'].str.replace('.0%','').astype('int')
df_winter_results.head()

We'll also need to reformat our winter results' `starting_year` and `season` values so that they match the formats found in the fall/spring table. 

The following cell replaces the 'W' values within the 'season' column with 'Winter' so that they'll match how seasons are formatted within df_fall_spring_results:

In [None]:
df_winter_results['season'] = (
    df_winter_results['season'].replace({'W':'Winter'}))
df_winter_results

The following code would also have worked; however, it assumes that every row within the DataFrame is indeed a winter result. This is the case in our simulated data, but in the real world, some data from other seasons might have leaked in, causing this code to incorrectly reclassify certain results.

In [None]:
# df_winter_results['season'] = 'Winter'

Finally, we'll add 2000 to every starting_year value so that our years will show up within YYYY format--just as they do within our fall and spring results.

In [None]:
df_winter_results['starting_year'] += 2000
df_winter_results.head()

## Removing duplicates

We've now successfully made our winter dataset's field names and values compatible with those in our fall/spring dataset. However, before we can combine the two together, we'll need to remove some duplicate results.

The following code filters df_winter_results to include any rows whose `season`, `starting_year`, and `student_id` columns match. (The inclusion of `keep = False` instructs Pandas to return all copies of a duplicated row, not just the first one that it encounters.)

In [None]:
df_winter_results[df_winter_results.duplicated(
    subset = ['season', 'starting_year', 'student_id'], 
    keep = False)].head()

These duplicate values can easily be removed using Pandas' drop_duplicates() function. However, before removing duplicate rows, it's a good idea to consider which one to retain and then sort the DataFrame accordingly. 

In our case, we'll keep the duplicated row with the highest survey result and remove all others. We can do this by (1) sorting our DataFrame to show higher scores before lower ones and then (2) keeping the first row (e.g. the one with the highest score) when removing our duplicates.

In [None]:
df_winter_results.sort_values(
    'score', ascending = False, inplace = True)
df_winter_results.head()

Removing duplicate values:

Note: when removing duplicates, think carefully about which columns to include in your `subset` argument. For instance, if we had multiple years' worth of data in our table, using `['season', 'student_id']` as your subset would cause only *one* result for each student/season pair to get retained, thus removing valid data for other years from your table.

In [None]:
df_winter_results.drop_duplicates(
    subset = ['season', 'starting_year', 'student_id'], 
    keep = 'first', inplace = True)
df_winter_results.head()

Rerunning our duplicate check code confirms that no duplicate entries remain within our dataset:

In [None]:
df_winter_results[df_winter_results.duplicated(
    subset = ['season', 'starting_year', 'student_id'], 
    keep = False)].head()

We're now finally ready to combine df_winter_results with df_fall_spring results. However, one final issue remains with this table, however: winter survey results are missing for a number of students. This won't cause any issues with the following code, but we'll need to take these missing entries into account when analyzing our survey data within descriptive_stats.ipynb.

## Combining winter survey results with our fall/spring dataset

In [None]:
df_results = pd.concat(
    [df_fall_spring_results, 
     df_winter_results]).sort_values(
    ['starting_year', 'season']).reset_index(drop=True)
df_results

We'll now save this dataset to a .csv file so that it can be processed by descriptive_stats.ipynb:

In [None]:
df_results.to_csv('2023_survey_results.csv', index = False)

This script has provided an introduction to data cleaning and reformatting. Other PFN sections will provide further examples of data reformatting, as reshaping data is often a necessary prerequisite for analysis and visualization tasks.