# Module 04 - Tidy data 
 What is Tidy Data?

 Tidy data is a structured format where:

     - Each row represents one observation (e.g., a country in a given year).
     - Each column is a variable (e.g., GDP, life expectancy).
     - Each table represents a dataset (e.g., economic statistics).

 💡 Why use tidy data?
 
     - Easier to analyze: Works well with `groupby()`, `agg()`, and visualization libraries like Seaborn.
     - More readable: No redundant columns.
     - Plays nicely with Pandas and Seaborn.

In [14]:
import pandas as pd

 ## Wide Format to Tidy (Long) Format

 In the dataset below, **each year's population** is in a **separate column**, which makes it a **wide format**.

 We can convert it to **tidy** format using `pd.melt()`.

In [15]:
# population over time
df = pd.DataFrame({
    "country": ["USA", "Canada", "Brazil"],
    "1990": [253, 28, 149],
    "2000": [282, 31, 170],
    "2010": [309, 34, 192],
    "2020": [339, 38, 209],
    "continent": ["North America", "North America", "South America"],
})

# Wide format
display(df)


Unnamed: 0,country,1990,2000,2010,2020,continent
0,USA,253,282,309,339,North America
1,Canada,28,31,34,38,North America
2,Brazil,149,170,192,209,South America


 ### `pd.melt()`

 - `id_vars`: The columns that **stay the same** (identifiers).

 - `var_name`: Name of the **new column** that will hold the old column headers (years).

 - `value_name`: Name of the new column that will store the values (population in this case).

In [16]:
df_tidy = df.melt(id_vars=["country","continent"], var_name="year", value_name="population")
display(df_tidy)


Unnamed: 0,country,continent,year,population
0,USA,North America,1990,253
1,Canada,North America,1990,28
2,Brazil,South America,1990,149
3,USA,North America,2000,282
4,Canada,North America,2000,31
5,Brazil,South America,2000,170
6,USA,North America,2010,309
7,Canada,North America,2010,34
8,Brazil,South America,2010,192
9,USA,North America,2020,339


 Notice how each row now represents **one country** in **one year**, and each column is **a single variable**.

 ## Converting Tidy (Long) Format Back to Wide Format

 - If you ever need to go back to **wide** format, you can use `pivot()` or `pivot_table()`.

In [17]:
df_wide = df_tidy.pivot(index="country", columns="year", values="population")
display(df_wide)


year,1990,2000,2010,2020
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Brazil,149,170,192,209
Canada,28,31,34,38
USA,253,282,309,339


 Here, each row is a **country**, and each column is a **year**—back to wide format.

 ## Summarizing Tidy Data with `groupby()`

 Tidy data makes it straightforward to **group** and **summarize**.



 ### `groupby("year")["population"].mean()`

 This computes the **mean population** for each year across **all countries**.

In [18]:
df_year_mean = df_tidy.groupby("year")["population"].mean()
display(df_year_mean)
# 

year
1990    143.333333
2000    161.000000
2010    178.333333
2020    195.333333
Name: population, dtype: float64

In [19]:
for key,data in df_tidy.groupby("year"):
     display(key)
     display(data)

'1990'

Unnamed: 0,country,continent,year,population
0,USA,North America,1990,253
1,Canada,North America,1990,28
2,Brazil,South America,1990,149


'2000'

Unnamed: 0,country,continent,year,population
3,USA,North America,2000,282
4,Canada,North America,2000,31
5,Brazil,South America,2000,170


'2010'

Unnamed: 0,country,continent,year,population
6,USA,North America,2010,309
7,Canada,North America,2010,34
8,Brazil,South America,2010,192


'2020'

Unnamed: 0,country,continent,year,population
9,USA,North America,2020,339
10,Canada,North America,2020,38
11,Brazil,South America,2020,209


 ### Grouping by Multiple Columns

 We can also group by **both** `year` **and** `country`.

In [20]:
df_year_country_sum = df_tidy.groupby(["year", "country"])["population"].sum()
display(df_year_country_sum)


year  country
1990  Brazil     149
      Canada      28
      USA        253
2000  Brazil     170
      Canada      31
      USA        282
2010  Brazil     192
      Canada      34
      USA        309
2020  Brazil     209
      Canada      38
      USA        339
Name: population, dtype: int64

 This returns a **multi-index Series**, showing the population **by year and by country**.

 ## `agg()` for Multiple Summaries

 The `agg()` function lets us apply **multiple aggregations** at once.

 For instance, we can find the **mean** and the **max** population per year.

In [21]:
df_agg = df_tidy.groupby("year").agg({"population": ["mean", "max","sum","median"]})
display(df_agg)


Unnamed: 0_level_0,population,population,population,population
Unnamed: 0_level_1,mean,max,sum,median
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1990,143.333333,253,430,149.0
2000,161.0,282,483,170.0
2010,178.333333,309,535,192.0
2020,195.333333,339,586,209.0


 This shows the average (`mean`) population and the maximum (`max`) population in each **year**.

 ## Handling Missing Data

 Let's introduce some **missing values** to demonstrate `dropna()` and `fillna()`.

In [22]:
# Create a copy with artificially introduced NaNs
df_missing = df_tidy.copy()
df_missing.loc[(df_missing["country"] == "Brazil") & (df_missing["year"] == "2020"), "population"] = None

display(df_missing)


Unnamed: 0,country,continent,year,population
0,USA,North America,1990,253.0
1,Canada,North America,1990,28.0
2,Brazil,South America,1990,149.0
3,USA,North America,2000,282.0
4,Canada,North America,2000,31.0
5,Brazil,South America,2000,170.0
6,USA,North America,2010,309.0
7,Canada,North America,2010,34.0
8,Brazil,South America,2010,192.0
9,USA,North America,2020,339.0


 ### `dropna()`

 - **Removes** rows with missing values.

In [23]:
df_dropped = df_missing.dropna(subset=["population"])
display(df_dropped)


Unnamed: 0,country,continent,year,population
0,USA,North America,1990,253.0
1,Canada,North America,1990,28.0
2,Brazil,South America,1990,149.0
3,USA,North America,2000,282.0
4,Canada,North America,2000,31.0
5,Brazil,South America,2000,170.0
6,USA,North America,2010,309.0
7,Canada,North America,2010,34.0
8,Brazil,South America,2010,192.0
9,USA,North America,2020,339.0


 Brazil's 2020 row is **completely removed** because of the missing population.



 ### `fillna()`

 - **Fills** missing values with a specified value or method.

In [24]:
df_filled = df_missing.fillna(0)
display(df_filled)


Unnamed: 0,country,continent,year,population
0,USA,North America,1990,253.0
1,Canada,North America,1990,28.0
2,Brazil,South America,1990,149.0
3,USA,North America,2000,282.0
4,Canada,North America,2000,31.0
5,Brazil,South America,2000,170.0
6,USA,North America,2010,309.0
7,Canada,North America,2010,34.0
8,Brazil,South America,2010,192.0
9,USA,North America,2020,339.0


 Now, the missing value is replaced with `0`.

 ## Combining Data with `merge()`

 Often, you'll have **multiple DataFrames** that need to be joined.

 Below is an example for merging a **GDP** dataset with our **population** dataset.

In [25]:
gdp_data = pd.DataFrame({
    "country": ["USA", "Canada", "Brazil"],
    "year": ["2020", "2020", "2020"],
    "gdp": [21439, 1736, 1445],  # GDP in billions (fictitious or approximate)
})

# Merging on both country and year
df_merged = df_tidy.merge(gdp_data, on=["country", "year"], how="left")
display(df_merged)


Unnamed: 0,country,continent,year,population,gdp
0,USA,North America,1990,253,
1,Canada,North America,1990,28,
2,Brazil,South America,1990,149,
3,USA,North America,2000,282,
4,Canada,North America,2000,31,
5,Brazil,South America,2000,170,
6,USA,North America,2010,309,
7,Canada,North America,2010,34,
8,Brazil,South America,2010,192,
9,USA,North America,2020,339,21439.0


 We used `how="left"` so that **all rows from `df_tidy`** are preserved, even if some may not match in `gdp_data`.



 - `how="inner"` would only keep matching rows.

 - `how="outer"` keeps **all** rows from both DataFrames.

 ## Example: `sort_values()` and `query()`

 Tidy data also makes it easy to **sort** and **filter**.

In [26]:
# Sort by population descending
df_sorted = df_tidy.sort_values("population", ascending=False)
display(df_sorted)


Unnamed: 0,country,continent,year,population
9,USA,North America,2020,339
6,USA,North America,2010,309
3,USA,North America,2000,282
0,USA,North America,1990,253
11,Brazil,South America,2020,209
8,Brazil,South America,2010,192
5,Brazil,South America,2000,170
2,Brazil,South America,1990,149
10,Canada,North America,2020,38
7,Canada,North America,2010,34


 ### `query()`

 An alternative way to filter rows:



 ```python

 df.query("population > 200 and country == 'USA'")

 ```



 is equivalent to



 ```python

 df[(df["population"] > 200) & (df["country"] == "USA")]

 ```

In [27]:
df_filtered = df_tidy.query("population > 300 and country == 'USA'")
display(df_filtered)


Unnamed: 0,country,continent,year,population
6,USA,North America,2010,309
9,USA,North America,2020,339
