# Introduction to Exploratory Data Analysis (EDA)

**Student Notebook**  

Welcome! This notebook will guide you through exploring *any* .csv dataset step by step. Don’t worry about making mistakes -- the whole point of EDA is to explore, and you're just getting started.

**Instructions:** In your small group, work through this jupyter notebook file together. Run each code cell from top to bottom and read the text cells carefully before running code. Remember to ASK QUESTIONS! 


## 0. Quick reminders

As we learned last class, using the "#" symbol leaves a comment. Please get in the habit of leaving comments throughout your code! 

Why do we use comments? Write your answer to this question next to the print statement below:

In [1]:
print("Leave a comment next to this code -->") # [Why do we use comments?]

Leave a comment next to this code -->


## 1. Import Required Libraries

These libraries are commonly used for data analysis in Python.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for plotting
# import seaborn as sns # for prettier plotting

Matplotlib is building the font cache; this may take a moment.


## 2. Load the Dataset

Replace `'your_dataset.csv'` with the name of the .csv file that you would like to analyze.

NOTE: If this cell gives an error, double-check that:
- The file name is spelled correctly
- The file is in the same folder as this notebook

In [None]:
try:
    data = pd.read_csv('./datasets/yourdataset')
    print('Dataset loaded successfully!')
except Exception as e:
    print('Error loading dataset:')
    print(e)

Dataset loaded successfully!


## 3. First Look at the Data

#### Note: Please fill in the [...] in the comments as related to your dataset. 

### Preview the first 5 rows

This helps you understand what each column represents.

In [7]:
data.head()

NameError: name 'data' is not defined

### Dataset Size

Rows = observations (people, days, items)  
Columns = variables (features)

In [None]:
data.shape # There are [...] observations, and [...] variables in our dataset. 

### Column Names

In [None]:
list(data.columns) # The variables tell us [...] about our observations, which represent [...]. 

**STOP & THINK**  
- Which columns seem numeric (i.e., are represented by numbers, we can do calculations on them)? 
- Which seem categorical (e.g. groups, classifications)?

## 4. Data Types

In [None]:
data.dtypes

Why this matters:
- Numeric data can be averaged, etc. 
- Categorical data is counted
- Dates allow time-based analysis

## 5. Summary Information

In [None]:
data.info() # What does the `info` function tell us about our data? 

### Summary Statistics (Numeric Columns)

In [None]:
data.describe() # What does the `describe` function tell us? 

## 6. Missing Data

Missing values are very common in real datasets.

In [None]:
data.isna().sum() # what does this seem to tell you? 

**STOP & THINK**  
- Which columns have missing values?
- Are they expected or surprising?
- Why might missingness be important? Should we throw all missingness away, if it exists? 

## 7. Optional: Automatically Detect Column Types

In [None]:
numeric_cols = data.select_dtypes(include='number').columns
categorical_cols = data.select_dtypes(exclude='number').columns

numeric_cols, categorical_cols

## 8. Exploring a Numeric Variable

Choose **one numeric column** from the list above and replace the name below.

In [None]:
col = numeric_cols[0]  # change this if you want

data[col].hist(bins=20)
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title(f'Distribution of {col}')
plt.show()

Questions to ask:
- Is the distribution skewed (shifting to one side) or centered? What does this mean with context? 
- Are there extreme values (outliers)?

## 9. Exploring a Categorical Variable

In [None]:
cat_col = categorical_cols[0]  # change this if you want

data[cat_col].value_counts()

In [None]:
data[cat_col].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.title(f'Counts of {cat_col}')
plt.show()

Any takeaways from the plot you made above?

## 10. Relationships Between Variables

### Numeric vs Numeric

In [None]:
if len(numeric_cols) >= 2:
    sns.scatterplot(data=data, x=numeric_cols[0], y=numeric_cols[1])
    plt.title('Relationship Between Two Numeric Variables')
    plt.show()
else:
    print('Not enough numeric columns for this plot.')

### Categorical vs Numeric

In [None]:
if len(categorical_cols) > 0 and len(numeric_cols) > 0:
    sns.boxplot(data=data, x=categorical_cols[0], y=numeric_cols[0])
    plt.title('Numeric Variable by Category')
    plt.show()
else:
    print('Not enough columns for this plot.')

## 11. Grouped Analysis

In [None]:
if len(categorical_cols) > 0 and len(numeric_cols) > 0:
    data.groupby(categorical_cols[0])[numeric_cols[0]].mean()
else:
    print('Not enough columns for grouped analysis.')

Discuss:  
- Which group has the highest average?
- Does this match your expectations?

## 12. Mini Challenge -- YOUR TURN!

1. Look up **one new pandas function** not used in this notebook
2. Try it on your dataset
3. Write one sentence explaining what it does

Helpful places:
- pandas documentation
- Stack Overflow
- Yunge, Jackson, mentors
- ChatGPT (if all fails...)

In [None]:
# test out whatever you want here! this is the fun part. do whatever your heart pleases. 

FILL THIS IN: 

The function I tried is called: `[...]`. It does [...]. I used it to learn [...] about my dataset. 

## Congrats -- you’ve completed your very own beginner-level EDA! 

**Key takeaways:**
- Always start by understanding structure and data types
- Missing values are normal
- Visuals reveal patterns tables cannot

**What do you remain curious about with your dataset? With Python?**