# 📂 Notebook 03: Loading and Exploring Data

Welcome to the real world — where data lives in messy CSVs and your job is to make sense of it.

In this notebook, you’ll:
- Load a CSV file using `pd.read_csv()`
- Use `.head()`, `.tail()`, `.info()`, and `.describe()` to explore your data
- Identify potential issues (missing values, bad types, oddball rows)

Let’s get our hands dirty.
---

In [2]:
import pandas as pd

## 📥 Load a CSV

Replace the filename below with a real CSV path or URL. For testing, use built-in seaborn datasets.

In [3]:
# Example with seaborn's Titanic dataset
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 📊 Peek at the Data

In [4]:
# First and last few rows
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

First 5 rows:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Last 5 rows:
     survived  pclass     sex   age  sibsp  parch   fare embarked   class  \
886         0       2    male  27.0      0      0  13.00        S  Second   
887         1       1  female  19.0  

## 🧠 Understand Structure

In [5]:
# Dimensions and columns
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

Shape: (891, 15)

Columns: ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']


## 🧼 Data Types & Missing Values

In [6]:
# Data types and null counts
print("\nInfo:")
df.info()

# How many missing values per column?
print("\nMissing values per column:")
print(df.isnull().sum())


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

Missing values per column:
survived         0
pclass           0
sex   

## 📈 Summary Statistics

In [7]:
df.describe(include="all")

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889,891,891,891,203,889,891,891
unique,,,2,,,,,3,3,3,2,7,3,2,2
top,,,male,,,,,S,Third,man,True,C,Southampton,no,True
freq,,,577,,,,,644,491,537,537,59,644,549,537
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,,,,,,,,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,,,,,,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,,,,,,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,,,,,,,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,,,,,,,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,,,,,,,,


## ✏️ Renaming Columns (Optional but fun)

In [8]:
# Rename 'sex' to 'gender' for clarity
df.rename(columns={"sex": "gender"}, inplace=True)
df.head(2)

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


---
## 🔍 Your Turn

1. Load a dataset of your choice (`pd.read_csv()` or `sns.load_dataset()`)
2. Print the first and last 5 rows.
3. Show `.info()` and `.describe()` results.
4. Print the column names. Rename one of them.

🎯 **Bonus:** What percentage of rows have *any* missing values?

```python
# HINT:
df.isnull().any(axis=1).mean() * 100  # percent of rows with any NaNs
```


In [19]:
# Load the dataset
bts_df = pd.read_csv("BTS_Members.csv", sep=',')

In [26]:
# Print first 5 rows
print("First 5 rows:", bts_df.head(5))

First 5 rows:           Name Stage Name                     Position  Birth Date  \
0  Kim Namjoon         RM         Leader / Main Rapper   9/12/1994   
1  Kim Seokjin        Jin            Vocalist / Visual   12/4/1992   
2   Min Yoongi       Suga                  Lead Rapper    3/9/1993   
3  Jung Hoseok     J-Hope         Main Dancer / Rapper   2/18/1994   
4   Park Jimin      Jimin  Main Dancer / Lead Vocalist  10/13/1995   

    Nationality  Debut Year    Solo Projects Instagram Handle  
0  South Korean        2013     Indigo; Mono           @rkive  
1  South Korean        2013    The Astronaut             @jin  
2  South Korean        2013       D-2; D-Day          @agustd  
3  South Korean        2013  Jack In The Box       @uarmyhope  
4  South Korean        2013             Face             @j.m  


In [29]:
# Print last 5 rows
print("Last 5 rows:\n", bts_df.tail(5))

Last 5 rows:
             Name Stage Name                              Position  Birth Date  \
2     Min Yoongi       Suga                           Lead Rapper    3/9/1993   
3    Jung Hoseok     J-Hope                  Main Dancer / Rapper   2/18/1994   
4     Park Jimin      Jimin           Main Dancer / Lead Vocalist  10/13/1995   
5   Kim Taehyung          V                              Vocalist  12/30/1995   
6  Jeon Jungkook   Jungkook  Main Vocalist / Lead Dancer / Maknae    9/1/1997   

    Nationality  Debut Year    Solo Projects Instagram Handle  
2  South Korean        2013       D-2; D-Day          @agustd  
3  South Korean        2013  Jack In The Box       @uarmyhope  
4  South Korean        2013             Face             @j.m  
5  South Korean        2013          Layover             @thv  
6  South Korean        2013           Golden     @jungkook.97  


In [31]:
print("Info:\n")
bts_df.info()

Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Name              7 non-null      object
 1   Stage Name        7 non-null      object
 2   Position          7 non-null      object
 3   Birth Date        7 non-null      object
 4   Nationality       7 non-null      object
 5   Debut Year        7 non-null      int64 
 6   Solo Projects     7 non-null      object
 7   Instagram Handle  7 non-null      object
dtypes: int64(1), object(7)
memory usage: 580.0+ bytes


In [33]:
# Describe results
bts_df.describe(include="all")

Unnamed: 0,Name,Stage Name,Position,Birth Date,Nationality,Debut Year,Solo Projects,Instagram Handle
count,7,7,7,7,7,7.0,7,7
unique,7,7,7,7,1,,7,7
top,Kim Namjoon,RM,Leader / Main Rapper,9/12/1994,South Korean,,Indigo; Mono,@rkive
freq,1,1,1,1,7,,1,1
mean,,,,,,2013.0,,
std,,,,,,0.0,,
min,,,,,,2013.0,,
25%,,,,,,2013.0,,
50%,,,,,,2013.0,,
75%,,,,,,2013.0,,


In [36]:
# Print the column names
print("Column names:\n", bts_df.columns)

Column names:
 Index(['Name', 'Stage Name', 'Position', 'Birth Date', 'Nationality',
       'Year of Debut', 'Solo Projects', 'Instagram Handle'],
      dtype='object')


In [37]:
# Rename a column name
bts_df.rename(columns={"Debut Year": "Year of Debut"}, inplace=True)
bts_df.head(2)

Unnamed: 0,Name,Stage Name,Position,Birth Date,Nationality,Year of Debut,Solo Projects,Instagram Handle
0,Kim Namjoon,RM,Leader / Main Rapper,9/12/1994,South Korean,2013,Indigo; Mono,@rkive
1,Kim Seokjin,Jin,Vocalist / Visual,12/4/1992,South Korean,2013,The Astronaut,@jin


In [39]:
# What percentage of rows have any missing values
print("Missing Values:\n")
print(bts_df.isnull().any(axis=1).mean() * 100)

Missing Values:

0.0


---
## 🎓 Why This Matters

Every data science project begins here — importing and inspecting data. If your data is a mess (spoiler: it always is), you need to know how to check it, clean it, and prep it.

Next up: slicing and dicing — using `.loc[]`, `.iloc[]`, and boolean masks to select what you want and ignore what you don't.