# Cleaning the Enchantments Lottery Data - 2021

## Introduction

The Enchantments lottery allows applicants to make 1-3 trip choices. Therefore, some rows contain `NaN`that need wrangling. In addition to handling `NaN` values, this part of the analysis coerced data types, checked for duplicates, and made further changes to—hopefully—make the data easier to work with.

## Changes

### Date columns

Multiple columns in the dataframe containing date values were identified as `object` type, so these date columns were converted to `datetime64[ns]`. Consequently, `NaN` values in these columns became zero epoch dates so the columns could maintain their date data type.

Converting `NaN` dates to zero epoch causes the date `1970-01-01` to appear in date columns where it may not make sense. This is intentional, and should alert the analyst to identify these zero epoch dates as empty or null values.

### Number columns

Some of the number columns were designated `int64` while others were `float64`. The division didn't make sense because all of the numbers could be whole numbers without data loss. Therefore, all the number columns were converted to `int64`. Additionally, all `NaN` values in number columns were filled with `0`.

#### Group size redundancy

The data showed columsn for minimum and maximum group size. However, the data between these columns showed up as zero. If there is no difference in these columns then the existense of the maximum column is redundant. The columns for maximum group size in each entry set were removed.

### Text columns

The columns containing text had many `NaN` values. These values were filled with `N/A`, possibly short for "No answer". Also, the text column values, which are of type `object`, were cast as type `str` to remove doubt of multiple data types.

### Columns names

The columns names were change to lowercase and spaces (or slashes) replaced with an underscore.

## Importing the cleaned dataset

Re-impoting the data seemed more troublesome than I hoped. Consquently, there are a few arguments to help the analyst import the data frame into `pandas` without having to redo cleaning steps. The code is below:

```
cleaned_df = pd.read_csv(
    "./2021_results_cleaned.csv",
    # Import was failing to parse date columns, so I
    # had to pass in the column names
    parse_dates=[
        "preferred_entry_date_1",
        "preferred_entry_date_2",
        "preferred_entry_date_3",
        "awarded_entry_date",
    ],
    date_format="%m-%d-%Y",  # Align format with export format
    na_filter=False,  # Do not convert 'N/A' to NaN
)
```


In [30]:
import pandas as pd

# Import 2021 Enchantments Lottery Data
raw_df = pd.read_csv("./2021_results.csv", header=1, parse_dates=True)

# Take a quick look at the data
raw_df.head()

Unnamed: 0,Preferred Entry Date 1,Preferred Division 1,Minimum Acceptable Group Size 1,Maximum Requested Group Size 1,Preferred Entry Date 2,Preferred Division 2,Minimum Acceptable Group Size 2,Maximum Requested Group Size 2,Preferred Entry Date 3,Preferred Division 3,Minimum Acceptable Group Size 3,Maximum Requested Group Size 3,Results Status,Awarded Preference,Awarded Entry Date,Awarded Entrance Code/Name,Awarded Group Size
0,7/25/2021,Core Enchantment Zone,7,7,8/1/2021,Core Enchantment Zone,7.0,7.0,7/25/2021,Snow Zone,8.0,8.0,Unsuccessful,,,,
1,8/12/2021,Core Enchantment Zone,8,8,8/12/2021,Colchuck Zone,8.0,8.0,8/12/2021,Eightmile/Caroline Zone,8.0,8.0,Unsuccessful,,,,
2,7/30/2021,Core Enchantment Zone,4,4,8/14/2021,Core Enchantment Zone,4.0,4.0,7/16/2021,Core Enchantment Zone,7.0,7.0,Unsuccessful,,,,
3,7/1/2021,Core Enchantment Zone,4,4,6/9/2021,Snow Zone,4.0,4.0,7/7/2021,Colchuck Zone,4.0,4.0,Unsuccessful,,,,
4,6/21/2021,Colchuck Zone,2,2,6/28/2021,Colchuck Zone,2.0,2.0,7/13/2021,Stuart Zone,2.0,2.0,Unsuccessful,,,,


In [31]:
# Check column data types
raw_df.dtypes

Preferred Entry Date 1              object
Preferred Division 1                object
Minimum Acceptable Group Size 1      int64
Maximum Requested Group Size 1       int64
Preferred Entry Date 2              object
Preferred Division 2                object
Minimum Acceptable Group Size 2    float64
Maximum Requested Group Size 2     float64
Preferred Entry Date 3              object
Preferred Division 3                object
Minimum Acceptable Group Size 3    float64
Maximum Requested Group Size 3     float64
Results Status                      object
Awarded Preference                 float64
Awarded Entry Date                  object
Awarded Entrance Code/Name          object
Awarded Group Size                 float64
dtype: object

In [32]:
# Identify columns with date data
date_columns = [
    "Preferred Entry Date 1",
    "Preferred Entry Date 2",
    "Preferred Entry Date 3",
    "Awarded Entry Date",
]

# Convert date columns to datetime
for col in date_columns:
    raw_df[col] = pd.to_datetime(raw_df[col])

# Check column data types
raw_df.dtypes

Preferred Entry Date 1             datetime64[ns]
Preferred Division 1                       object
Minimum Acceptable Group Size 1             int64
Maximum Requested Group Size 1              int64
Preferred Entry Date 2             datetime64[ns]
Preferred Division 2                       object
Minimum Acceptable Group Size 2           float64
Maximum Requested Group Size 2            float64
Preferred Entry Date 3             datetime64[ns]
Preferred Division 3                       object
Minimum Acceptable Group Size 3           float64
Maximum Requested Group Size 3            float64
Results Status                             object
Awarded Preference                        float64
Awarded Entry Date                 datetime64[ns]
Awarded Entrance Code/Name                 object
Awarded Group Size                        float64
dtype: object

In [33]:
# Number columns to convert NaN values to 0
number_columns = [
    "Minimum Acceptable Group Size 2",
    "Maximum Requested Group Size 2",
    "Minimum Acceptable Group Size 3",
    "Maximum Requested Group Size 3",
    "Awarded Preference",
    "Awarded Group Size",
]

# Convert NaN values to 0
for col in number_columns:
    raw_df[col] = raw_df[col].fillna(0)

# Convert float to int
for col in raw_df.columns:
    if raw_df[col].dtype == "float64":
        raw_df[col] = raw_df[col].astype(int)

# Check column data types
raw_df.dtypes

Preferred Entry Date 1             datetime64[ns]
Preferred Division 1                       object
Minimum Acceptable Group Size 1             int64
Maximum Requested Group Size 1              int64
Preferred Entry Date 2             datetime64[ns]
Preferred Division 2                       object
Minimum Acceptable Group Size 2             int64
Maximum Requested Group Size 2              int64
Preferred Entry Date 3             datetime64[ns]
Preferred Division 3                       object
Minimum Acceptable Group Size 3             int64
Maximum Requested Group Size 3              int64
Results Status                             object
Awarded Preference                          int64
Awarded Entry Date                 datetime64[ns]
Awarded Entrance Code/Name                 object
Awarded Group Size                          int64
dtype: object

In [34]:
# Fill NaN values in string columns and convert to string
columns_to_convert = [
    "Preferred Division 1",
    "Preferred Division 2",
    "Preferred Division 3",
    "Results Status",
    "Awarded Entrance Code/Name",
]
for col in columns_to_convert:
    # Converting to string may be unneccessary here
    raw_df[col] = raw_df[col].fillna("N/A").astype(str)

# Check column data types
raw_df.dtypes

Preferred Entry Date 1             datetime64[ns]
Preferred Division 1                       object
Minimum Acceptable Group Size 1             int64
Maximum Requested Group Size 1              int64
Preferred Entry Date 2             datetime64[ns]
Preferred Division 2                       object
Minimum Acceptable Group Size 2             int64
Maximum Requested Group Size 2              int64
Preferred Entry Date 3             datetime64[ns]
Preferred Division 3                       object
Minimum Acceptable Group Size 3             int64
Maximum Requested Group Size 3              int64
Results Status                             object
Awarded Preference                          int64
Awarded Entry Date                 datetime64[ns]
Awarded Entrance Code/Name                 object
Awarded Group Size                          int64
dtype: object

In [35]:
# Check for NaN values
raw_df.isna().sum()

Preferred Entry Date 1                 0
Preferred Division 1                   0
Minimum Acceptable Group Size 1        0
Maximum Requested Group Size 1         0
Preferred Entry Date 2               422
Preferred Division 2                   0
Minimum Acceptable Group Size 2        0
Maximum Requested Group Size 2         0
Preferred Entry Date 3              1021
Preferred Division 3                   0
Minimum Acceptable Group Size 3        0
Maximum Requested Group Size 3         0
Results Status                         0
Awarded Preference                     0
Awarded Entry Date                 34250
Awarded Entrance Code/Name             0
Awarded Group Size                     0
dtype: int64

In [36]:
# Convert NaN values in date columns to 0
# This feels like an odd approach, but I want to maintain the date data type.
# The analyst will need to understand that zero epoch dates are actually NaN values.
for col in date_columns:  # Date columns defined in previous cell
    raw_df[col] = raw_df[col].fillna(pd.Timestamp(0))

In [37]:
# Check for NaN values
raw_df.isna().sum()

Preferred Entry Date 1             0
Preferred Division 1               0
Minimum Acceptable Group Size 1    0
Maximum Requested Group Size 1     0
Preferred Entry Date 2             0
Preferred Division 2               0
Minimum Acceptable Group Size 2    0
Maximum Requested Group Size 2     0
Preferred Entry Date 3             0
Preferred Division 3               0
Minimum Acceptable Group Size 3    0
Maximum Requested Group Size 3     0
Results Status                     0
Awarded Preference                 0
Awarded Entry Date                 0
Awarded Entrance Code/Name         0
Awarded Group Size                 0
dtype: int64

In [38]:
# Check data types
raw_df.dtypes

Preferred Entry Date 1             datetime64[ns]
Preferred Division 1                       object
Minimum Acceptable Group Size 1             int64
Maximum Requested Group Size 1              int64
Preferred Entry Date 2             datetime64[ns]
Preferred Division 2                       object
Minimum Acceptable Group Size 2             int64
Maximum Requested Group Size 2              int64
Preferred Entry Date 3             datetime64[ns]
Preferred Division 3                       object
Minimum Acceptable Group Size 3             int64
Maximum Requested Group Size 3              int64
Results Status                             object
Awarded Preference                          int64
Awarded Entry Date                 datetime64[ns]
Awarded Entrance Code/Name                 object
Awarded Group Size                          int64
dtype: object

In [39]:
# Check values for each column
for col in raw_df.columns:
    print(f"{col}: {raw_df[col].unique()}\n\n\n")

Preferred Entry Date 1: <DatetimeArray>
['2021-07-25 00:00:00', '2021-08-12 00:00:00', '2021-07-30 00:00:00',
 '2021-07-01 00:00:00', '2021-06-21 00:00:00', '2021-08-06 00:00:00',
 '2021-08-10 00:00:00', '2021-07-08 00:00:00', '2021-08-31 00:00:00',
 '2021-08-19 00:00:00',
 ...
 '2021-10-23 00:00:00', '2021-10-18 00:00:00', '2021-05-25 00:00:00',
 '2021-10-29 00:00:00', '2021-10-21 00:00:00', '2021-10-27 00:00:00',
 '2021-10-30 00:00:00', '2021-10-26 00:00:00', '2021-10-19 00:00:00',
 '2021-10-24 00:00:00']
Length: 167, dtype: datetime64[ns]



Preferred Division 1: ['Core Enchantment Zone' 'Colchuck Zone' 'Snow Zone' 'Stuart  Zone'
 'Eightmile/Caroline Zone' 'Eightmile/Caroline Zone (stock)'
 'Stuart Zone (stock)']



Minimum Acceptable Group Size 1: [7 8 4 2 5 6 3 1]



Maximum Requested Group Size 1: [7 8 4 2 5 6 3 1]



Preferred Entry Date 2: <DatetimeArray>
['2021-08-01 00:00:00', '2021-08-12 00:00:00', '2021-08-14 00:00:00',
 '2021-06-09 00:00:00', '2021-06-28 00:00:00', '2021-0

In [40]:
# Examine the first 20 rows
raw_df.head(20)

Unnamed: 0,Preferred Entry Date 1,Preferred Division 1,Minimum Acceptable Group Size 1,Maximum Requested Group Size 1,Preferred Entry Date 2,Preferred Division 2,Minimum Acceptable Group Size 2,Maximum Requested Group Size 2,Preferred Entry Date 3,Preferred Division 3,Minimum Acceptable Group Size 3,Maximum Requested Group Size 3,Results Status,Awarded Preference,Awarded Entry Date,Awarded Entrance Code/Name,Awarded Group Size
0,2021-07-25,Core Enchantment Zone,7,7,2021-08-01,Core Enchantment Zone,7,7,2021-07-25,Snow Zone,8,8,Unsuccessful,0,1970-01-01,,0
1,2021-08-12,Core Enchantment Zone,8,8,2021-08-12,Colchuck Zone,8,8,2021-08-12,Eightmile/Caroline Zone,8,8,Unsuccessful,0,1970-01-01,,0
2,2021-07-30,Core Enchantment Zone,4,4,2021-08-14,Core Enchantment Zone,4,4,2021-07-16,Core Enchantment Zone,7,7,Unsuccessful,0,1970-01-01,,0
3,2021-07-01,Core Enchantment Zone,4,4,2021-06-09,Snow Zone,4,4,2021-07-07,Colchuck Zone,4,4,Unsuccessful,0,1970-01-01,,0
4,2021-06-21,Colchuck Zone,2,2,2021-06-28,Colchuck Zone,2,2,2021-07-13,Stuart Zone,2,2,Unsuccessful,0,1970-01-01,,0
5,2021-08-06,Core Enchantment Zone,4,4,2021-08-13,Core Enchantment Zone,4,4,2021-09-17,Core Enchantment Zone,4,4,Unsuccessful,0,1970-01-01,,0
6,2021-08-10,Core Enchantment Zone,8,8,2021-08-11,Core Enchantment Zone,8,8,2021-08-17,Colchuck Zone,8,8,Unsuccessful,0,1970-01-01,,0
7,2021-07-08,Core Enchantment Zone,2,2,2021-08-05,Core Enchantment Zone,2,2,2021-08-12,Core Enchantment Zone,2,2,Unsuccessful,0,1970-01-01,,0
8,2021-08-31,Colchuck Zone,2,2,2021-09-01,Colchuck Zone,2,2,2021-09-02,Colchuck Zone,2,2,Unsuccessful,0,1970-01-01,,0
9,2021-08-19,Colchuck Zone,4,4,2021-08-19,Core Enchantment Zone,4,4,2021-08-26,Colchuck Zone,4,4,Unsuccessful,0,1970-01-01,,0


In [41]:
# The group size columns have the same values for the minimum and maximum group size
# This is redundant and we can drop the maximum group size columns
print(
    (
        raw_df["Maximum Requested Group Size 1"]
        - raw_df["Minimum Acceptable Group Size 1"]
    ).unique()
)
print(
    (
        raw_df["Maximum Requested Group Size 2"]
        - raw_df["Minimum Acceptable Group Size 2"]
    ).unique()
)
print(
    (
        raw_df["Maximum Requested Group Size 3"]
        - raw_df["Minimum Acceptable Group Size 3"]
    ).unique()
)

[0]
[0]
[0]


In [42]:
# Drop the maximum group size columns because there is no variation from the minimum group size columns
raw_df = raw_df.drop(
    columns=[
        "Maximum Requested Group Size 1",
        "Maximum Requested Group Size 2",
        "Maximum Requested Group Size 3",
    ]
)

# Check the data
raw_df.head(20)

Unnamed: 0,Preferred Entry Date 1,Preferred Division 1,Minimum Acceptable Group Size 1,Preferred Entry Date 2,Preferred Division 2,Minimum Acceptable Group Size 2,Preferred Entry Date 3,Preferred Division 3,Minimum Acceptable Group Size 3,Results Status,Awarded Preference,Awarded Entry Date,Awarded Entrance Code/Name,Awarded Group Size
0,2021-07-25,Core Enchantment Zone,7,2021-08-01,Core Enchantment Zone,7,2021-07-25,Snow Zone,8,Unsuccessful,0,1970-01-01,,0
1,2021-08-12,Core Enchantment Zone,8,2021-08-12,Colchuck Zone,8,2021-08-12,Eightmile/Caroline Zone,8,Unsuccessful,0,1970-01-01,,0
2,2021-07-30,Core Enchantment Zone,4,2021-08-14,Core Enchantment Zone,4,2021-07-16,Core Enchantment Zone,7,Unsuccessful,0,1970-01-01,,0
3,2021-07-01,Core Enchantment Zone,4,2021-06-09,Snow Zone,4,2021-07-07,Colchuck Zone,4,Unsuccessful,0,1970-01-01,,0
4,2021-06-21,Colchuck Zone,2,2021-06-28,Colchuck Zone,2,2021-07-13,Stuart Zone,2,Unsuccessful,0,1970-01-01,,0
5,2021-08-06,Core Enchantment Zone,4,2021-08-13,Core Enchantment Zone,4,2021-09-17,Core Enchantment Zone,4,Unsuccessful,0,1970-01-01,,0
6,2021-08-10,Core Enchantment Zone,8,2021-08-11,Core Enchantment Zone,8,2021-08-17,Colchuck Zone,8,Unsuccessful,0,1970-01-01,,0
7,2021-07-08,Core Enchantment Zone,2,2021-08-05,Core Enchantment Zone,2,2021-08-12,Core Enchantment Zone,2,Unsuccessful,0,1970-01-01,,0
8,2021-08-31,Colchuck Zone,2,2021-09-01,Colchuck Zone,2,2021-09-02,Colchuck Zone,2,Unsuccessful,0,1970-01-01,,0
9,2021-08-19,Colchuck Zone,4,2021-08-19,Core Enchantment Zone,4,2021-08-26,Colchuck Zone,4,Unsuccessful,0,1970-01-01,,0


In [43]:
raw_df.columns

Index(['Preferred Entry Date 1', 'Preferred Division 1',
       'Minimum Acceptable Group Size 1', 'Preferred Entry Date 2',
       'Preferred Division 2', 'Minimum Acceptable Group Size 2',
       'Preferred Entry Date 3', 'Preferred Division 3',
       'Minimum Acceptable Group Size 3', 'Results Status',
       'Awarded Preference', 'Awarded Entry Date',
       'Awarded Entrance Code/Name', 'Awarded Group Size'],
      dtype='object')

In [44]:
# Change columns names to lower case with underscores for spaces
raw_df.columns = [
    col.lower().replace(" ", "_").replace("/", "_") for col in raw_df.columns
]

# Check the names
raw_df.columns

Index(['preferred_entry_date_1', 'preferred_division_1',
       'minimum_acceptable_group_size_1', 'preferred_entry_date_2',
       'preferred_division_2', 'minimum_acceptable_group_size_2',
       'preferred_entry_date_3', 'preferred_division_3',
       'minimum_acceptable_group_size_3', 'results_status',
       'awarded_preference', 'awarded_entry_date',
       'awarded_entrance_code_name', 'awarded_group_size'],
      dtype='object')

In [45]:
# Check the data
raw_df.head(20)

Unnamed: 0,preferred_entry_date_1,preferred_division_1,minimum_acceptable_group_size_1,preferred_entry_date_2,preferred_division_2,minimum_acceptable_group_size_2,preferred_entry_date_3,preferred_division_3,minimum_acceptable_group_size_3,results_status,awarded_preference,awarded_entry_date,awarded_entrance_code_name,awarded_group_size
0,2021-07-25,Core Enchantment Zone,7,2021-08-01,Core Enchantment Zone,7,2021-07-25,Snow Zone,8,Unsuccessful,0,1970-01-01,,0
1,2021-08-12,Core Enchantment Zone,8,2021-08-12,Colchuck Zone,8,2021-08-12,Eightmile/Caroline Zone,8,Unsuccessful,0,1970-01-01,,0
2,2021-07-30,Core Enchantment Zone,4,2021-08-14,Core Enchantment Zone,4,2021-07-16,Core Enchantment Zone,7,Unsuccessful,0,1970-01-01,,0
3,2021-07-01,Core Enchantment Zone,4,2021-06-09,Snow Zone,4,2021-07-07,Colchuck Zone,4,Unsuccessful,0,1970-01-01,,0
4,2021-06-21,Colchuck Zone,2,2021-06-28,Colchuck Zone,2,2021-07-13,Stuart Zone,2,Unsuccessful,0,1970-01-01,,0
5,2021-08-06,Core Enchantment Zone,4,2021-08-13,Core Enchantment Zone,4,2021-09-17,Core Enchantment Zone,4,Unsuccessful,0,1970-01-01,,0
6,2021-08-10,Core Enchantment Zone,8,2021-08-11,Core Enchantment Zone,8,2021-08-17,Colchuck Zone,8,Unsuccessful,0,1970-01-01,,0
7,2021-07-08,Core Enchantment Zone,2,2021-08-05,Core Enchantment Zone,2,2021-08-12,Core Enchantment Zone,2,Unsuccessful,0,1970-01-01,,0
8,2021-08-31,Colchuck Zone,2,2021-09-01,Colchuck Zone,2,2021-09-02,Colchuck Zone,2,Unsuccessful,0,1970-01-01,,0
9,2021-08-19,Colchuck Zone,4,2021-08-19,Core Enchantment Zone,4,2021-08-26,Colchuck Zone,4,Unsuccessful,0,1970-01-01,,0


In [46]:
# Export cleaned data to csv
raw_df.to_csv("./2021_results_cleaned.csv", index=False, date_format="%m-%d-%Y")

In [None]:
# Check import of cleaned data
cleaned_df = pd.read_csv(
    "./2021_results_cleaned.csv",
    # Import was failing to parse date columns, so I
    # had to pass in the column names
    parse_dates=[
        "preferred_entry_date_1",
        "preferred_entry_date_2",
        "preferred_entry_date_3",
        "awarded_entry_date",
    ],
    date_format="%m-%d-%Y",  # Align format with export format
    na_filter=False,  # Do not convert 'N/A' to NaN
)

# Check the datatypes
cleaned_df.dtypes

preferred_entry_date_1             datetime64[ns]
preferred_division_1                       object
minimum_acceptable_group_size_1             int64
preferred_entry_date_2             datetime64[ns]
preferred_division_2                       object
minimum_acceptable_group_size_2             int64
preferred_entry_date_3             datetime64[ns]
preferred_division_3                       object
minimum_acceptable_group_size_3             int64
results_status                             object
awarded_preference                          int64
awarded_entry_date                 datetime64[ns]
awarded_entrance_code_name                 object
awarded_group_size                          int64
dtype: object

In [48]:
# Check cleaned dataframe for NaN values
cleaned_df.isna().sum()

preferred_entry_date_1             0
preferred_division_1               0
minimum_acceptable_group_size_1    0
preferred_entry_date_2             0
preferred_division_2               0
minimum_acceptable_group_size_2    0
preferred_entry_date_3             0
preferred_division_3               0
minimum_acceptable_group_size_3    0
results_status                     0
awarded_preference                 0
awarded_entry_date                 0
awarded_entrance_code_name         0
awarded_group_size                 0
dtype: int64