# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is one part of the Data Investigation Process that can include data cleaning, wrangling, and visualization. The "Data Moves" framework:

- Provides a structured set of categories (i.e., data moves) to describe and analyze how students engage with data

- Supports instructional design and assessment by offering a lens through which educators can identify, understand, and demonstrate data practices

Before exploring data, it is important to select datasets that are appropriate for students based on grade-level, subject area relevance, size, and number of freatures (i.e., variables). This notebook covers basic considerations that should be made before selecting datasets suitable for use in exploratory data analysis within an introductory data science course. It also provides examples of the core data moves along with explanations of output generated by each the move (e.g, a value, table, visualization, etc.).

## Selecting a Dataset

### Tidy Data

The 2014 paper *Tidy Data* presents a structured framework for organizing datasets to support efficient analysis. In a tidy dataset, each variable forms a column, each observation forms a row, and each type of observational unit is stored in a separate table. It also outlines strategies for transforming messy data into tidy form, demonstrating how this approach simplifies and strengthens data analysis practices.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

### Tame Data

The 2018 paper The fivethirtyeight R Package introduces the concept of tame data, which refers to datasets that are clean, well-labeled, and easy to use in teaching. Tame data minimizes the need for wrangling so students can focus on analysis. The paper highlights the importance of using structured and accessible data in introductory statistics and data science courses.

Kim, A. Y., Ismay, C., & Chunn, J. (2018). The fivethirtyeight R Package: “Tame Data” Principles for Introductory Statistics and Data Science Courses. Technology Innovations in Statistics Education, 11(1). https://doi.org/10.5070/T5111035892

## Investigating Data Like a Data Scientist

Investigating data like a data scientist involves an iterative process of making sense of information. This process includes six key phases: 

- Framing the problem
- Exploring and visualizing data
- Modeling
- Evaluating results
- Crafting a narrative
- Communicating findings

These phases reflect authentic data science practice and provide a structure that support more meaningful engagement with data in educational settings.

Rather than following a fixed procedure, this framework emphasizes the importance of habits of mind such as critical thinking, refining questions, and considering the audience. It highlights that data science relies not only on technical skills but also on decision-making, interpretation, and storytelling. When students are guided through these phases, they develop the analytical reasoning needed to navigate data and communicate insights.

## Data Investigation Process Framework


### Frame Problem

- Consider real-world phenomena and broader issues related to the problem
- Pose investigative question(s)
- Anticipate potential data and strategies

### Consider & Gather Data

- Understand possible attributes, measurements, and data collection methods needed for the problem
- Evaluate and use appropriate design and techniques to collect or source data
- Consider sample size, access, storage, and trustworthiness of data

### Process Data

- Organize, structure, clean, and transform data in efficient and useful ways
- Consider additional data cases or attributes

### Explore & Visualize Data

- Construct meaningful visualizations, static or dynamic
- Compute meaningful statistical measures
- Explore and analyze data for potential relationships or patterns that address the problem

### Consider Models

- Analyze and identify models that address the problem
- Consider assumptions and context of the models
- Recognize possible limitations

### Communicate & Propose Action

- Craft a data story to convey insight to stakeholder audiences
- Justify claims with evidence from data and propose possible action
- Address uncertainty, constraints, and potential bias in the analysis

Lee, H. S., Wilkerson, M. H., & Zuckerman, S. J. (2022). Investigating data like a data scientist: A framework for elementary, middle, and high school teachers. *Statistics Education Research Journal, 21*(2). [https://doi.org/10.52041/serj.v21i2.41](https://doi.org/10.52041/serj.v21i2.41)

# Data Moves

Data moves are strategic actions taken during data analysis to reshape and prepare datasets for interpretation. These include filtering, grouping, summarizing, calculating, merging or joining data, and creating hierarchical structures. Each move alters the structure, content, or values of a dataset, influencing what patterns become visible and what questions can be explored. By understanding these moves, learners gain insight into how data analysis is an active, decision-driven process rather than a passive application of procedures.

Erickson, T., Tinker, R., & Yasuda, M. (2019). *Data moves*. UC Berkeley: The Concord Consortium. eScholarship, University of California. https://escholarship.org/uc/item/0mg8m7g6

## Summarizing

Summarizing is the process of producing and recording a summary or aggregate value, i.e., a statistic. 

- The mean is a common summary function, but it is not the only option.
- Summary measures are not limited to numerical or “typical” values.
- Some summaries are non-numerical, i.e., identifying the most common category such as the most frequently mentioned pet type ("dog") in a survey.
- The purpose of summarizing is not always to focus on the measure itself or to compare across groups; an aggregate value can also serve as data for further analysis.

# Data Dive

A data dive is a focused exploratory analysis where students work closely with a dataset to uncover patterns, trends, and relationships by applying key data moves. For example, using a dataset about school lunch nutrition, students might begin by filtering to isolate meals served in a specific year or location. They could group the data by food category such as fruits, grains, or proteins to explore how nutritional content differs across types. Through summarizing, they might calculate the average number of calories or the typical sodium content for each group. Calculating might involve creating new variables, such as calories per gram or the percentage of a recommended daily intake. If additional data sources are provided, students could join datasets, such as connecting lunch menus with student demographic information, to add context and depth to their findings. These data moves help students make sense of multivariable datasets and support evidence-based insights.

## Analysis with Data Moves in Python

In this section, we present example use cases that demonstrate data moves using the Python programming language. While Python includes built-in tools and data structures for general data handling, it does not include a built-in data structure specifically designed for working with tidy data as defined by the _Tidy Data_ paper. To support the tidy data format and organize the analysis around data moves such as filtering, grouping, and summarizing, we will use the Pandas library for data wrangling, Numpy for scientific computing, and the Matplotlib library for creating visualizations.

### Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Updated Starwars Dataset

Fabricio Narcizo’s blog post, Introduction to Data Analysis using the Star Wars Dataset, presents an expanded version of the original R dplyr Star Wars dataset, growing it from 14 to 25 variables. By cross-referencing Wookieepedia, he corrected and enriched the character data with new fields like birth/death details, pronouns, occupations, and abilities, resulting in a more accurate and comprehensive dataset for analysis.

#### Updated Starwars Datasheet

[Updated Starwars Datasheet](https://docs.google.com/document/d/1Gr6W0xo1pxW-TZH2GKawIASqZ7JfURmzQO1Eg2JGsC8/edit?usp=sharing)

Narcizo, F. B. (2023, December 30). Introduction to Data Analysis using the Star Wars Dataset. Retrieved from https://www.fabricionarcizo.com/post/starwars/

**Example 1.** Load the dataset into a pandas `DataFrame` called `starwars`.


In [None]:
starwars = pd.read_csv("data/updated_starwars.csv")

### Summarzing

**Example 2.** Use the `.info()` method to display a summary of the `starwars` dataFrame.

In [None]:
# .info() is a method that gives a quick summary of the DataFrame
# It shows the number of rows and columns
# It lists the column names and their data types
# It tells how many non-missing values are in each column
starwars.info()

**Output Observations:** 

**Example 3.** Select and display the `mass` column from the `starwars` dataFrame.

In [None]:
# Selects the 'mass' column from the starwars DataFrame
# Returns a Series of character masses
# The index on the left corresponds to the row number (i.e., each character)
# NaN means that the mass value is missing for that character
starwars["mass"]

**Output Observations:** 

**Example 4.** Use the `.describe()` method to generate summary statistics for the `mass` column in the `starwars` dataframe.  

In [None]:
# Calculates summary statistics for the 'mass' column
# count – number of non-missing values
# mean – average mass
# std – standard deviation (shows spread)
# min – smallest mass
# 25%, 50%, 75% – quartiles (values that split the data into four parts)
# max – largest mass
starwars["mass"].describe()

**Output Observations:** 

**Example 5.** Create a histogram of the `mass` column to visualize the distribution of character masses.

In [None]:
# starwars["mass"] uses bracket notation to select a single column from the DataFrame
# The result is a pandas Series containing just the 'mass' values
# .hist() creates a histogram to show the frequency of different mass ranges
starwars["mass"].hist()

# Turns off the background grid in the plot for a cleaner visual
plt.grid(False);

**Output Observations:** 

**Example 6.** Count how many times each species appears in the `species` column of the `starwars` dataframe.

In [None]:
# starwars["species"] uses bracket notation to select the 'species' column from the DataFrame
# This returns a pandas Series containing all species values
# .value_counts() counts how many times each unique value appears
# The result is sorted by count, from most to least common
starwars["species"].value_counts()

**Output Observations:** 

**Example 7.** Create a horizontal bar chart to visualize the species counts.

In [None]:
# starwars["species"] uses bracket notation to select the 'species' column
# .value_counts() counts how many times each species appears
# The result is saved to a variable called tbl
tbl = starwars["species"].value_counts()

# Creates a horizontal bar chart of the species counts
# kind='barh' specifies a horizontal bar plot
tbl.plot(kind = 'barh');

**Output Observations:** 

**Example 8.** Create a horizontal bar chart to visualize the species counts.

In [None]:
# Count how many times each species appears
tbl = starwars["species"].value_counts()

# Combine all species from index 2 onward (everything except the top 2) 
# into a new Series labeled "Other"
other = pd.Series([tbl[2:].sum()], index=["Other"])

# Concatenate the top 2 species with the "Other" group
# Then sort the result so the bars appear in order from 
# smallest to largest
pd.concat([tbl.iloc[0:2], other]).sort_values(ascending=True).plot(kind = 'barh');

**Output Observations:**  

**Example 9.** Count how many times each eye color appears in the `eye_color` column of the `starwars` dataframe.

In [None]:
# Selects the 'eye_color' column using bracket notation
# Returns a Series of eye colors for all characters
# .value_counts() counts how many times each unique eye color appears
# The result is sorted from most to least frequent
starwars["eye_color"].value_counts()

**Output Observations:** 

**Example 10.** Count how many times each hair color appears in the `hair_color` column of the `starwars` dataframe.

In [None]:
# Selects the 'hair_color' column from the DataFrame using bracket notation
# This returns a Series of all hair color values
# .value_counts() counts how often each unique value appears
# The result is sorted from most to least frequent
starwars["hair_color"].value_counts()

In [None]:
# Counts how many times each unique hair color appears
# dropna=False includes missing values (NaN) in the count
# Without this, missing values would be excluded by default
starwars["hair_color"].value_counts(dropna=False)

**Output Observations:** 

**Example 11.** Create a cross-tabulation table to compare eye color and hair color. Use `pd.crosstab()` to count how many characters have each combination of eye color and hair color.

In [None]:
# Creates a cross-tabulation of eye color (rows) and hair color (columns)
# Counts how many characters fall into each combination of eye color and hair color
# Stores the result in a new table called tbl
tbl = pd.crosstab(starwars["eye_color"], starwars["hair_color"])

# Display the resulting table

tbl

**Output Observations:** 