# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is one part of the Data Investigation Process that can include data cleaning, wrangling, and visualization. The "Data Moves" framework:

- Provides a structured set of categories (i.e., data moves) to describe and analyze how students engage with data

- Supports instructional design and assessment by offering a lens through which educators can identify, understand, and demonstrate data practices

Before exploring data, it is important to select datasets that are appropriate for students based on grade-level, subject area relevance, size, and number of freatures (i.e., variables). This notebook covers basic considerations that should be made before selecting datasets suitable for use in exploratory data analysis within an introductory data science course. It also provides examples of the core data moves along with explanations of output generated by each the move (e.g, a value, table, visualization, etc.).

## Selecting a Dataset

### Tidy Data

The 2014 paper *Tidy Data* presents a structured framework for organizing datasets to support efficient analysis. In a tidy dataset, each variable forms a column, each observation forms a row, and each type of observational unit is stored in a separate table. It also outlines strategies for transforming messy data into tidy form, demonstrating how this approach simplifies and strengthens data analysis practices.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

### Tame Data

The 2018 paper The fivethirtyeight R Package introduces the concept of tame data, which refers to datasets that are clean, well-labeled, and easy to use in teaching. Tame data minimizes the need for wrangling so students can focus on analysis. The paper highlights the importance of using structured and accessible data in introductory statistics and data science courses.

Kim, A. Y., Ismay, C., & Chunn, J. (2018). The fivethirtyeight R Package: “Tame Data” Principles for Introductory Statistics and Data Science Courses. Technology Innovations in Statistics Education, 11(1). https://doi.org/10.5070/T5111035892

## Investigating Data Like a Data Scientist

Investigating data like a data scientist involves an iterative process of making sense of information. This process includes six key phases: 

- Framing the problem
- Exploring and visualizing data
- Modeling
- Evaluating results
- Crafting a narrative
- Communicating findings

These phases reflect authentic data science practice and provide a structure that support more meaningful engagement with data in educational settings.

Rather than following a fixed procedure, this framework emphasizes the importance of habits of mind such as critical thinking, refining questions, and considering the audience. It highlights that data science relies not only on technical skills but also on decision-making, interpretation, and storytelling. When students are guided through these phases, they develop the analytical reasoning needed to navigate data and communicate insights.

## Data Investigation Process Framework


### Frame Problem

- Consider real-world phenomena and broader issues related to the problem
- Pose investigative question(s)
- Anticipate potential data and strategies

### Consider & Gather Data

- Understand possible attributes, measurements, and data collection methods needed for the problem
- Evaluate and use appropriate design and techniques to collect or source data
- Consider sample size, access, storage, and trustworthiness of data

### Process Data

- Organize, structure, clean, and transform data in efficient and useful ways
- Consider additional data cases or attributes

### Explore & Visualize Data

- Construct meaningful visualizations, static or dynamic
- Compute meaningful statistical measures
- Explore and analyze data for potential relationships or patterns that address the problem

### Consider Models

- Analyze and identify models that address the problem
- Consider assumptions and context of the models
- Recognize possible limitations

### Communicate & Propose Action

- Craft a data story to convey insight to stakeholder audiences
- Justify claims with evidence from data and propose possible action
- Address uncertainty, constraints, and potential bias in the analysis

Lee, H. S., Wilkerson, M. H., & Zuckerman, S. J. (2022). Investigating data like a data scientist: A framework for elementary, middle, and high school teachers. *Statistics Education Research Journal, 21*(2). [https://doi.org/10.52041/serj.v21i2.41](https://doi.org/10.52041/serj.v21i2.41)

# Data Moves

Data moves are strategic actions taken during data analysis to reshape and prepare datasets for interpretation. These include filtering, grouping, summarizing, calculating, merging or joining data, and creating hierarchical structures. Each move alters the structure, content, or values of a dataset, influencing what patterns become visible and what questions can be explored. By understanding these moves, learners gain insight into how data analysis is an active, decision-driven process rather than a passive application of procedures.

Erickson, T., Tinker, R., & Yasuda, M. (2019). *Data moves*. UC Berkeley: The Concord Consortium. eScholarship, University of California. https://escholarship.org/uc/item/0mg8m7g6

## Grouping

Grouping is used to set up a comparison among different subgroups of a dataset. Just as filtering restricts analysis to a single subset, grouping divides a dataset into multiple subsets. This division is guided by the available value(s) of some attribute or attributes so that, among the cases within each resulting group, the values of these "grouping" attributes are the same.

**Note:** Grouping and summarizing are often used together to simplify complex datasets by reducing them to fewer data points that highlight overall patterns. However, this simplification can result in a loss of detail, such as variability.

# Data Dive

A data dive is a focused exploratory analysis where students work closely with a dataset to uncover patterns, trends, and relationships by applying key data moves. For example, using a dataset about school lunch nutrition, students might begin by filtering to isolate meals served in a specific year or location. They could group the data by food category such as fruits, grains, or proteins to explore how nutritional content differs across types. Through summarizing, they might calculate the average number of calories or the typical sodium content for each group. Calculating might involve creating new variables, such as calories per gram or the percentage of a recommended daily intake. If additional data sources are provided, students could join datasets, such as connecting lunch menus with student demographic information, to add context and depth to their findings. These data moves help students make sense of multivariable datasets and support evidence-based insights.

## Analysis with Data Moves in Python

In this section, we present example use cases that demonstrate data moves using the Python programming language. While Python includes built-in tools and data structures for general data handling, it does not include a built-in data structure specifically designed for working with tidy data as defined by the _Tidy Data_ paper. To support the tidy data format and organize the analysis around data moves such as filtering, grouping, and summarizing, we will use the Pandas library for data wrangling, Numpy for scientific computing, and the Matplotlib library for creating visualizations.

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Updated Starwars Dataset

Fabricio Narcizo’s blog post, Introduction to Data Analysis using the Star Wars Dataset, presents an expanded version of the original R dplyr Star Wars dataset, growing it from 14 to 25 variables. By cross-referencing Wookieepedia, he corrected and enriched the character data with new fields like birth/death details, pronouns, occupations, and abilities, resulting in a more accurate and comprehensive dataset for analysis.

#### Updated Starwars Datasheet

[Updated Starwars Datasheet](https://docs.google.com/document/d/1Gr6W0xo1pxW-TZH2GKawIASqZ7JfURmzQO1Eg2JGsC8/edit?usp=sharing)

Narcizo, F. B. (2023, December 30). Introduction to Data Analysis using the Star Wars Dataset. Retrieved from https://www.fabricionarcizo.com/post/starwars/

**Example 1.** Load the dataset into a pandas `DataFrame` called `starwars`.


In [None]:
starwars = pd.read_csv("data/updated_starwars.csv")

### Grouping

`group_species` is a user-defined function that classifies a species value as either `"Human"`, `"Droid"`, or `"Other"`. It takes a single input, `s`, which must be a string of `None`. If the input is `"Human"` or `"Droid"`, the function returns that same value. If the input is anything other species name or missing values, it returns `"Other"`. 

In [None]:
def group_species(s):
    """
    Classifies a species value as 'Human', 'Droid', or 'Other'.
    
    Examples:
    
    group_species("Human")  returns "Human"
    group_species("Rodian") returns "Other"
    group_species("Droid")  returns "Droid"

    Parameters:
    s: A single species value from the dataset.

    Returns:
    str: Returns 'Human' or 'Droid' if the input matches those values. 
         Otherwise returns 'Other'.
         
    Precondition: s is a string or None
    """
    if s in ["Human", "Droid"]:
        return s
    else:
        return "Other"

In [None]:
# .apply() runs the group_species function on each value in the 'species' column
# It goes through the column one row at a time
# This returns a Series of values of 'Human', 'Droid', or 'Other'
starwars["species_grps"] = starwars["species"].apply(group_species)

In [None]:
# The .groupby() method is used to group the DataFrame by the 
# values in 'species_grps' column
# It creates a GroupBy object that allows you to perform 
# operations separately for each group
starwars.groupby("species_grps")

In [None]:
# Uses the .groupby() method to group the DataFrame by the 'species_grps' column
# Stores the resulting GroupBy object in the variable 'grps'
grps = starwars.groupby("species_grps")

# Accesses the dictionary of group labels (keys) from the GroupBy object
# This returns a view object with the unique values 
# in 'species_grps' that were used to form the groups
grps.groups.keys()

In [None]:
# Accesses the .groups attribute of the GroupBy object
# This returns a dictionary where the keys are group labels (e.g., 'Human', 'Droid', 'Other')
# and the values are lists of row indices from the DataFrame that belong to each group
grps.groups

**Example 2.** Use the `grps` `GroupBy` object to calculate the average mass for each species group.

In [None]:
# Selects the 'mass' column from the GroupBy object 'grps'
# Uses the .mean() method to calculate the average mass for each group in 'species_grps'
# Returns a Series showing the mean mass for Humans, Droids, and Other species
grps["mass"].mean()

**Output Observations:**

**Example 3.** Use the `grps` `GroupBy` object to count how many times each homeworld appears within each species group.

In [None]:
# Selects the 'homeworld' column from the GroupBy object 'grps'
# Applies .value_counts() to count how many times each homeworld appears within each species group
# Returns a Series with counts of homeworlds for 'Human', 'Droid', and 'Other' groups
grps["homeworld"].value_counts()

**Example 4.** Create a box plot to compare the distribution of character heights across the three species groups: `"Human"`, `"Droid"`, and `"Other"`. 

In [None]:
# Creates a box plot of the 'height' column grouped by 'species_grps'
# This shows the median, quartiles, and possible outliers for each group
starwars.boxplot(column="height", by="species_grps")

# Turns off the background grid
plt.grid(False);

**Output Observations:** 