# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is one part of the Data Investigation Process that can include data cleaning, wrangling, and visualization. The "Data Moves" framework:

- Provides a structured set of categories (i.e., data moves) to describe and analyze how students engage with data

- Supports instructional design and assessment by offering a lens through which educators can identify, understand, and demonstrate data practices

Before exploring data, it is important to select datasets that are appropriate for students based on grade-level, subject area relevance, size, and number of freatures (i.e., variables). This notebook covers basic considerations that should be made before selecting datasets suitable for use in exploratory data analysis within an introductory data science course. It also provides examples of the core data moves along with explanations of output generated by each the move (e.g, a value, table, visualization, etc.).

## Selecting a Dataset

### Tidy Data

The 2014 paper *Tidy Data* presents a structured framework for organizing datasets to support efficient analysis. In a tidy dataset, each variable forms a column, each observation forms a row, and each type of observational unit is stored in a separate table. It also outlines strategies for transforming messy data into tidy form, demonstrating how this approach simplifies and strengthens data analysis practices.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

### Tame Data

The 2018 paper The fivethirtyeight R Package introduces the concept of tame data, which refers to datasets that are clean, well-labeled, and easy to use in teaching. Tame data minimizes the need for wrangling so students can focus on analysis. The paper highlights the importance of using structured and accessible data in introductory statistics and data science courses.

Kim, A. Y., Ismay, C., & Chunn, J. (2018). The fivethirtyeight R Package: “Tame Data” Principles for Introductory Statistics and Data Science Courses. Technology Innovations in Statistics Education, 11(1). https://doi.org/10.5070/T5111035892

## Investigating Data Like a Data Scientist

Investigating data like a data scientist involves an iterative process of making sense of information. This process includes six key phases: 

- Framing the problem
- Exploring and visualizing data
- Modeling
- Evaluating results
- Crafting a narrative
- Communicating findings

These phases reflect authentic data science practice and provide a structure that support more meaningful engagement with data in educational settings.

Rather than following a fixed procedure, this framework emphasizes the importance of habits of mind such as critical thinking, refining questions, and considering the audience. It highlights that data science relies not only on technical skills but also on decision-making, interpretation, and storytelling. When students are guided through these phases, they develop the analytical reasoning needed to navigate data and communicate insights.

## Data Investigation Process Framework


### Frame Problem

- Consider real-world phenomena and broader issues related to the problem
- Pose investigative question(s)
- Anticipate potential data and strategies

### Consider & Gather Data

- Understand possible attributes, measurements, and data collection methods needed for the problem
- Evaluate and use appropriate design and techniques to collect or source data
- Consider sample size, access, storage, and trustworthiness of data

### Process Data

- Organize, structure, clean, and transform data in efficient and useful ways
- Consider additional data cases or attributes

### Explore & Visualize Data

- Construct meaningful visualizations, static or dynamic
- Compute meaningful statistical measures
- Explore and analyze data for potential relationships or patterns that address the problem

### Consider Models

- Analyze and identify models that address the problem
- Consider assumptions and context of the models
- Recognize possible limitations

### Communicate & Propose Action

- Craft a data story to convey insight to stakeholder audiences
- Justify claims with evidence from data and propose possible action
- Address uncertainty, constraints, and potential bias in the analysis

Lee, H. S., Wilkerson, M. H., & Zuckerman, S. J. (2022). Investigating data like a data scientist: A framework for elementary, middle, and high school teachers. *Statistics Education Research Journal, 21*(2). [https://doi.org/10.52041/serj.v21i2.41](https://doi.org/10.52041/serj.v21i2.41)

# Data Moves

Data moves are strategic actions taken during data analysis to reshape and prepare datasets for interpretation. These include filtering, grouping, summarizing, calculating, merging or joining data, and creating hierarchical structures. Each move alters the structure, content, or values of a dataset, influencing what patterns become visible and what questions can be explored. By understanding these moves, learners gain insight into how data analysis is an active, decision-driven process rather than a passive application of procedures.

Erickson, T., Tinker, R., & Yasuda, M. (2019). *Data moves*. UC Berkeley: The Concord Consortium. eScholarship, University of California. https://escholarship.org/uc/item/0mg8m7g6

## Merging

Merging combines multiple datasets into one. The simplest form of merging concatenates datasets about the same phenomenon but from different sources, for example, combining height data from two different classrooms to make a larger dataset. 

## Joining

Joining is a more complex form of merging. It does not add new cases, but
rather adds more information (i.e., new attributes) about existing cases from a
separate dataset. For example, in a school system, student demographic data might be stored in one table and test scores in another. Using a student ID as a key, the two tables can be joined to combine information for the same students.

# Data Dive

A data dive is a focused exploratory analysis where students work closely with a dataset to uncover patterns, trends, and relationships by applying key data moves. For example, using a dataset about school lunch nutrition, students might begin by filtering to isolate meals served in a specific year or location. They could group the data by food category such as fruits, grains, or proteins to explore how nutritional content differs across types. Through summarizing, they might calculate the average number of calories or the typical sodium content for each group. Calculating might involve creating new variables, such as calories per gram or the percentage of a recommended daily intake. If additional data sources are provided, students could join datasets, such as connecting lunch menus with student demographic information, to add context and depth to their findings. These data moves help students make sense of multivariable datasets and support evidence-based insights.

## Analysis with Data Moves in Python

In this section, we present example use cases that demonstrate data moves using the Python programming language. While Python includes built-in tools and data structures for general data handling, it does not include a built-in data structure specifically designed for working with tidy data as defined by the _Tidy Data_ paper. To support the tidy data format and organize the analysis around data moves such as filtering, grouping, and summarizing, we will use the Pandas library for data wrangling, Numpy for scientific computing, and the Matplotlib library for creating visualizations.

### Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Merging

### Popular Baby Names Dataset

This dataset contains state-specific data on the relative frequency of given names for individuals issued a Social Security Number in the United States. Data is tabulated from Social Security Administration records as of March 2, 2025. The files include annual birth name frequencies by sex and state, beginning in 1910, for all 50 states and the District of Columbia.

Each file lists names with at least 5 occurrences in a given year to protect individual privacy. Records are sorted by sex, year, and descending frequency, with alphabetical order breaking ties, which enables direct rank determination.

#### Popular Baby Names Datasheet

[Popular Baby Names Datasheet](https://docs.google.com/document/d/1uMFpRbvO1NhGVvfRSw3eDpp1O-6a7UeLE3GeZ8OfTuM/edit?usp=sharing)

Social Security Administration. (n.d.). Popular baby names: Data limits and exclusions. Retrieved March 2, 2025, from https://www.ssa.gov/oact/babynames/limits.html

The code below loads a subset of state files into separate dataframes. The table that follows shows how each file corresponds to its respective dataframe.

|State|Abbreviation|
|:-----|:------------|
|Indiana| `ind`|
|Michigan| `mich`|
|Ohio| `ohio`| 
|Pennsylvania| `penn`|
|North Carolina| `nc`|

In [None]:
ind = pd.read_csv("data/IN.TXT")
mich = pd.read_csv("data/MI.TXT")
ohio = pd.read_csv("data/OH.TXT")
penn = pd.read_csv("data/PA.TXT")
nc = pd.read_csv("data/NC.TXT")

It’s good practice to inspect the structure and metadata of a `DataFrame` using the `.info()` method.

**Structure** refers to the overall layout of the `DataFrame`, including:
- Number of rows and columns
- Column names
- Data types (e.g., int64, object, float64)
- Index type and range

**Metadata** refers to information about the data rather than the data itself, including:
- Which columns contain missing values (non-null counts)
- Total memory usage
- Index type

**Example 1.** Run the cell below. What do you notice?

In [None]:
ohio.info()

**Output Observations:**

In [None]:
ohio.head()

In [None]:
# Reads the 'OH.TXT' file from the 'data' folder into a DataFrame named 'ohio'
# header=None tells pandas that the file does not have a header row, so it should 
# not treat the first row as column names
# Default column names will be assigned as integers: 0, 1, 2, 3, etc.
ohio = pd.read_csv("data/OH.TXT", header=None)

In [None]:
ohio.info()

**Example 2.** Rename the columns of the `ohio` `DataFrame` as `state`, `sex`, `year`, `name`, and `count`. Then, display all column names to confirm that the new column was added.

In [None]:
# Assigns custom column names to the 'ohio' DataFrame
# These names replace the default numeric column 
# labels (0, 1, 2, 3, 4)
ohio.columns = ['state', 'sex', 'year', 'name', 'count']
ohio.columns

In [None]:
ohio.info()

**Example 3.** Filter the `ohio` `DataFrame` for the name "Alexa" to see how its popularity has changed overtime.

In [None]:
name = "Alexa"

# Creates a Boolean mask that is True for rows where the 'name' column matches the variable 'name'
mask = ohio["name"] == name

# Uses the mask to filter the DataFrame and return only the matching rows
# Then selects the 'year' and 'count' columns using double brackets [[...]] to return a DataFrame
ohio[mask][["year", "count"]]

**Example 4.** Create a line chart to visualize how the popularity of the name "Alexa" has changed over time.

In [None]:
ohio[ohio["name"] == name].plot(x="year", y="count", kind="line");

**Output Observations:** 

**Example 5.** Merge the Michigan data with the Ohio data.

In [None]:
mich = pd.read_csv("data/MI.TXT", header=None)
mich.columns = ['state', 'sex', 'year', 'name', 'count']
mich.info()

In [None]:
# Combines the 'ohio' and 'mich' DataFrames into a single DataFrame using pd.concat()
# ignore_index=True resets the row index so it runs from 0 to n-1 in the combined DataFrame
pd.concat([ohio, mich], ignore_index=True)

In [None]:
# Stores the combined data from both Ohio and Michigan in a DataFrame
ohio_mich = pd.concat([ohio, mich], ignore_index=True)

# Displays the combined DataFrame with data from both Ohio and Michigan
# By default the first 5 and the last 5 rows are shown
ohio_mich

### Joining

### CEO Compensation Summary Dataset

The data from the AFL-CIO Executive Paywatch database draws from company proxy statements that are filed with the U.S. Securities and Exchange Commission and collected by [pay-gap.com](https://aflcio.org/paywatch/pay-gap.com). The database includes data for some 3,000 corporations, including most of those listed in the Russell 3000 Index. Industry classifications are based on North American Industry Classification System codes.

#### CEO Compensation Summary Datasheet

[CEO Compensation Summary Datasheet](https://docs.google.com/document/d/1AJriZiqMarx8-r4WZwoXoCkFKLEDpq2jt0V6t5_tQLM/edit?usp=drive_link)

AFL-CIO. (n.d.). Highest-Paid CEOs. Retrieved 2022, from https://aflcio.org/paywatch/highest-paid-ceos

In [None]:
ceo = pd.read_csv("data/ceo_compensation_summary.csv")
ceo.info()

In [None]:
ceo.head()

### Compnay Information Dataset

The companies in this dataset come from the AFL-CIO Executive Paywatch database, which compiles data from company proxy statements filed with the U.S. Securities and Exchange Commission and collected by paygap.com. The dataset includes approximately 3,000 corporations, primarily those listed in the Russell 3000 Index, with industry classifications based on North American Industry Classification System (NAICS) codes. To supplement this dataset, additional company information including sector, industry, and market capitalization was collected using the Python yfinance library, which provides streamlined access to company data from Yahoo Finance.

#### Compnay Information Datasheet

[Compnay Information Datasheet](https://docs.google.com/document/d/1t_J1RKSpc8qXhozS8F1K82Ac-429zETQUJXT8PvRCm0/edit?usp=sharing)

Ran, A. (2019). yfinance: Yahoo! Finance market data downloader [Python library]. https://github.com/ranaroussi/yfinance

In [None]:
company = pd.read_csv("data/company_information.csv")
company.info()

In [None]:
company.head()

**Example 6.** Display the column names from the both the `ceo` and the `company` `DtatFrame`.

In [None]:
print("Colimns in the ceo dataframe")
print(ceo.columns)
print("\n")
print("Columns in the company dataframe")
print(company.columns)

**Example 7.** Display the information in the first row of the `ceo` `DataFrame`.

In [None]:
# .loc is a label-based accessor used to retrieve rows (and optionally columns) by index label
# This line retrieves the row in the 'ceo' DataFrame with index label 0
# It returns all column values for that row as a Series
ceo.loc[0]

**Example 8.** Display the information in the first row of the `company` `DataFrame`.

In [None]:
# .loc is a label-based accessor used to retrieve rows (and optionally columns) by index label
# This line retrieves the row in the 'company' DataFrame with index label 0
# It returns all column values for that row as a Series
company.loc[0]

**Example 9.** Use `pd.merge()` to combine the `ceo` and `company` dataframes based on the shared `ticker` column. 

In [None]:
# Merges the 'ceo' and 'company' DataFrames using the 'ticker' column as the key
# Only rows with matching 'ticker' values in both DataFrames will be included (inner join by default)
# Returns a new DataFrame that combines columns from both sources
pd.merge(ceo, company, on="ticker")