# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is one part of the Data Investigation Process that can include data cleaning, wrangling, and visualization. The "Data Moves" framework:

- Provides a structured set of categories (i.e., data moves) to describe and analyze how students engage with data

- Supports instructional design and assessment by offering a lens through which educators can identify, understand, and demonstrate data practices

Before exploring data, it is important to select datasets that are appropriate for students based on grade-level, subject area relevance, size, and number of freatures (i.e., variables). This notebook covers basic considerations that should be made before selecting datasets suitable for use in exploratory data analysis within an introductory data science course. It also provides examples of the core data moves along with explanations of output generated by each the move (e.g, a value, table, visualization, etc.).

## Selecting a Dataset

### Tidy Data

The 2014 paper *Tidy Data* presents a structured framework for organizing datasets to support efficient analysis. In a tidy dataset, each variable forms a column, each observation forms a row, and each type of observational unit is stored in a separate table. It also outlines strategies for transforming messy data into tidy form, demonstrating how this approach simplifies and strengthens data analysis practices.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

### Tame Data

The 2018 paper The fivethirtyeight R Package introduces the concept of tame data, which refers to datasets that are clean, well-labeled, and easy to use in teaching. Tame data minimizes the need for wrangling so students can focus on analysis. The paper highlights the importance of using structured and accessible data in introductory statistics and data science courses.

Kim, A. Y., Ismay, C., & Chunn, J. (2018). The fivethirtyeight R Package: “Tame Data” Principles for Introductory Statistics and Data Science Courses. Technology Innovations in Statistics Education, 11(1). https://doi.org/10.5070/T5111035892

## Investigating Data Like a Data Scientist

Investigating data like a data scientist involves an iterative process of making sense of information. This process includes six key phases: 

- Framing the problem
- Exploring and visualizing data
- Modeling
- Evaluating results
- Crafting a narrative
- Communicating findings

These phases reflect authentic data science practice and provide a structure that support more meaningful engagement with data in educational settings.

Rather than following a fixed procedure, this framework emphasizes the importance of habits of mind such as critical thinking, refining questions, and considering the audience. It highlights that data science relies not only on technical skills but also on decision-making, interpretation, and storytelling. When students are guided through these phases, they develop the analytical reasoning needed to navigate data and communicate insights.

## Data Investigation Process Framework


### Frame Problem

- Consider real-world phenomena and broader issues related to the problem
- Pose investigative question(s)
- Anticipate potential data and strategies

### Consider & Gather Data

- Understand possible attributes, measurements, and data collection methods needed for the problem
- Evaluate and use appropriate design and techniques to collect or source data
- Consider sample size, access, storage, and trustworthiness of data

### Process Data

- Organize, structure, clean, and transform data in efficient and useful ways
- Consider additional data cases or attributes

### Explore & Visualize Data

- Construct meaningful visualizations, static or dynamic
- Compute meaningful statistical measures
- Explore and analyze data for potential relationships or patterns that address the problem

### Consider Models

- Analyze and identify models that address the problem
- Consider assumptions and context of the models
- Recognize possible limitations

### Communicate & Propose Action

- Craft a data story to convey insight to stakeholder audiences
- Justify claims with evidence from data and propose possible action
- Address uncertainty, constraints, and potential bias in the analysis

Lee, H. S., Wilkerson, M. H., & Zuckerman, S. J. (2022). Investigating data like a data scientist: A framework for elementary, middle, and high school teachers. *Statistics Education Research Journal, 21*(2). [https://doi.org/10.52041/serj.v21i2.41](https://doi.org/10.52041/serj.v21i2.41)

# Data Moves

Data moves are strategic actions taken during data analysis to reshape and prepare datasets for interpretation. These include filtering, grouping, summarizing, calculating, merging or joining data, and creating hierarchical structures. Each move alters the structure, content, or values of a dataset, influencing what patterns become visible and what questions can be explored. By understanding these moves, learners gain insight into how data analysis is an active, decision-driven process rather than a passive application of procedures.

Erickson, T., Tinker, R., & Yasuda, M. (2019). *Data moves*. UC Berkeley: The Concord Consortium. eScholarship, University of California. https://escholarship.org/uc/item/0mg8m7g6

## Calculating

Calculating is the process of create=ing a new attribute, often represented by a new column in a data table. This typically involves calculating the values in this new attribute using a formula.

- Many new attributes are calculated using the values from one or more existing attributes. 
- Summary measures function as new, conceptual attributes as well; the difference is that they appear on group labels rather than individual data cases.

In addition to conceptual attributes, calculating can also be used to create
convenience attributes. For example, one may wish to create a categorical
attribute whose value is “tall” if an individual’s height is greater than the
mean height for their age, and “short” otherwise. Convenience attributes are quite common. Other examples include:

- Creating a new column that converts heights to inches instead of centimeters.
- Using birth dates included in a dataset to compute subjects’ ages.
- Recoding an education attribute from several categories (e.g., "GED," "high-school graduate," "one year of college," "bachelor’s degree," etc.) to fewer (perhaps, "completed high school," "completed college").

# Data Dive

A data dive is a focused exploratory analysis where students work closely with a dataset to uncover patterns, trends, and relationships by applying key data moves. For example, using a dataset about school lunch nutrition, students might begin by filtering to isolate meals served in a specific year or location. They could group the data by food category such as fruits, grains, or proteins to explore how nutritional content differs across types. Through summarizing, they might calculate the average number of calories or the typical sodium content for each group. Calculating might involve creating new variables, such as calories per gram or the percentage of a recommended daily intake. If additional data sources are provided, students could join datasets, such as connecting lunch menus with student demographic information, to add context and depth to their findings. These data moves help students make sense of multivariable datasets and support evidence-based insights.

## Analysis with Data Moves in Python

In this section, we present example use cases that demonstrate data moves using the Python programming language. While Python includes built-in tools and data structures for general data handling, it does not include a built-in data structure specifically designed for working with tidy data as defined by the _Tidy Data_ paper. To support the tidy data format and organize the analysis around data moves such as filtering, grouping, and summarizing, we will use the Pandas library for data wrangling, Numpy for scientific computing, and the Matplotlib library for creating visualizations.

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Updated Starwars Dataset

Fabricio Narcizo’s blog post, Introduction to Data Analysis using the Star Wars Dataset, presents an expanded version of the original R dplyr Star Wars dataset, growing it from 14 to 25 variables. By cross-referencing Wookieepedia, he corrected and enriched the character data with new fields like birth/death details, pronouns, occupations, and abilities, resulting in a more accurate and comprehensive dataset for analysis.

#### Updated Starwars Datasheet

[Updated Starwars Datasheet](https://docs.google.com/document/d/1Gr6W0xo1pxW-TZH2GKawIASqZ7JfURmzQO1Eg2JGsC8/edit?usp=sharing)

Narcizo, F. B. (2023, December 30). Introduction to Data Analysis using the Star Wars Dataset. Retrieved from https://www.fabricionarcizo.com/post/starwars/

**Example 1.** Load the dataset into a pandas `DataFrame` called `starwars`.


In [None]:
starwars = pd.read_csv("data/updated_starwars.csv")

### Calculating

**Example 2.** Convert all the heights in the `starwars` dataframe from meters to centimeters.

In [None]:
# Converts the 'height' column from centimeters to meters
# Divides each value in the 'height' column by 100
# This returns a Series of height in meters for each character
starwars["height"] / 100

**Example 3.** Create a new column in the `starwars` `DataFrame` called `height_m` that stores each character's height in meters. Then, display all column names to confirm that the new column was added.

In [None]:
# Creates a new column called 'height_m' by converting height from centimeters to meters
starwars["height_m"] = starwars["height"] / 100

# Displays the names of all columns to confirm the new column was added
starwars.columns

**Example 4.** Calculate the Body Mass Index (BMI) for each character.

In [None]:
# Calculates BMI using the formula: mass (kg) divided by height (m) squared
# 'mass' in kilograms
# 'height_m' in meters
# This returns a Series of BMI values for each character
starwars["mass"] / (starwars["height_m"]**2)

**Example 5.** Create a new column called `bmi` in the `starwars` `DataFrame` by calculating the Body Mass Index (BMI) for each character. Then display the column names to confirm the new column was added.

In [None]:
starwars["bmi"] = starwars["mass"] / (starwars["height_m"]**2)
starwars.columns

**Example 6.** Create a list that classifies each character as either `"Human"`, `"Droid"`, or `"Other"` based on the value in the `species` column.

In [None]:
# Create an empty list to store the new species labels
species = []

# Loop through each value in the 'species' column
for label in starwars["species"]:
    
    # If the species is Human or Droid, keep the original label
    if label in ["Human", "Droid"]:
        species.append(label)
    
    # If the species is something else, label it as "Other"
    else:
        species.append("Other")

# Print the final list of species labels
print(species)

**Example 7.** Use the `.isin()` method to filter the the `starwars` `DataFrame` to identify which rows correspond to characters whose species is either Human or Droid.

In [None]:
# Checks if the value in the 'species' column is either "Human" or "Droid"
# .isin() returns True if the species is in the list, otherwise False
# This returns a Series of True/False values
starwars["species"].isin(["Human", "Droid"])

**Example 8.** Use `np.where` to create an array that classifies each character as either `"Human"`, `"Droid"`, or `"Other"` based on the value in the species column. 

In [None]:
# Uses np.where to classify species as "Human", "Droid", or "Other"
# The first argument checks if each species is either "Human" or "Droid"
# The second argument keeps the original species if the condition is True
# The third argument replaces all other species with "Other"
# This returns a numpy array with values
np.where(starwars["species"].isin(["Human", "Droid"]), starwars["species"], "Other")

`group_species` is a user-defined function that classifies a species value as either `"Human"`, `"Droid"`, or `"Other"`. It takes a single input, `s`, which must be a string of `None`. If the input is `"Human"` or `"Droid"`, the function returns that same value. If the input is anything other species name or missing values, it returns `"Other"`. 

In [None]:
def group_species(s):
    """
    Classifies a species value as 'Human', 'Droid', or 'Other'.
    
    Examples:
    
    group_species("Human")  returns "Human"
    group_species("Rodian") returns "Other"
    group_species("Droid")  returns "Droid"

    Parameters:
    s: A single species value from the dataset.

    Returns:
    str: Returns 'Human' or 'Droid' if the input matches those values. 
         Otherwise returns 'Other'.
         
    Precondition: s is a string or None
    """
    if s in ["Human", "Droid"]:
        return s
    else:
        return "Other"

In [None]:
# .apply() runs the group_species function on each value in the 'species' column
# It goes through the column one row at a time
# This returns a Series of values of 'Human', 'Droid', or 'Other'
starwars["species"].apply(group_species)

You can use any of the previously demonstrated programming techniques to add a column to the `starwars` `DataFrame`. 

```python
starwars["species_grps"] = species
starwars["species_grps"] = starwars["species"].apply(group_species)
starwars["species_grps"] = np.where(starwars["species"].isin(["Human", "Droid"]), starwars["species"], "Other")
```

**Example 24.** Choose a technique to add a new column to the `starwars` `DataFrame` that classifies each character as `"Human"`, `"Droid"`, or `"Other"` based on their species.

In [None]:
starwars["species_grps"] = ...
starwars.columns