<a href="https://colab.research.google.com/github/jhlopesalves/Datacamp-Notebooks/blob/main/Python/Data%20Manipulation%20with%20pandas/data_manipulation_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Data Manipulation with Pandas

**Pandas** (Python Data Analysis Library) is one of the most powerful and popular libraries for data manipulation and analysis in Python. Built on top of NumPy, pandas provides high-performance, easy-to-use data structures and data analysis tools that make working with structured data both intuitive and efficient.

## Why Pandas is Essential

Pandas has become the **backbone of data science** in Python for several compelling reasons:

### **Powerful Data Structures**
- **DataFrame**: 2-dimensional labeled data structure (think Excel spreadsheet with superpowers)
- **Series**: 1-dimensional labeled array (a single column of data)

### **Versatile Data Handling**
- Read from multiple file formats (CSV, Excel, JSON, SQL databases, and more)
- Handle missing data gracefully
- Perform complex data transformations with simple, readable code
- Group, aggregate, and pivot data effortlessly

## The Foundation of Data Science

Pandas serves as the **bridge between raw data and insights**. Whether you're cleaning messy datasets, performing exploratory data analysis, or preparing data for machine learning models, pandas provides the tools you need to transform data into actionable intelligence.

## Official Documentation

For comprehensive guides, API references, and advanced tutorials, visit the official pandas documentation:

**[https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)**

**Reasoning**:
Create a directory named 'data' if it doesn't already exist.



In [77]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os

In [78]:
if not os.path.exists('data'):
    os.makedirs('data')

!wget -q https://raw.githubusercontent.com/jhlopesalves/Datacamp-Notebooks/refs/heads/main/Python/Data%20Manipulation%20with%20pandas/data/homelessness.csv -O data/homelessness.csv
!wget -q https://raw.githubusercontent.com/jhlopesalves/Datacamp-Notebooks/refs/heads/main/Python/Data%20Manipulation%20with%20pandas/data/sales_subset.csv -O data/sales_subset.csv
!wget -q https://raw.githubusercontent.com/jhlopesalves/Datacamp-Notebooks/refs/heads/main/Python/Data%20Manipulation%20with%20pandas/data/temperatures.csv -O data/temperatures.csv
!wget -q https://raw.githubusercontent.com/jhlopesalves/Datacamp-Notebooks/blob/main/Python/Data%20Manipulation%20with%20pandas/data/avoplotto.pkl -O data/avoplotto.pkl

print("Files downloaded succesfully")

Files downloaded succesfully


### Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

`.head()` returns the first few rows (the “head” of the DataFrame).
`.info()` shows information on each of the columns, such as the data type and number of missing values.
`.shape` returns the number of rows and columns of the DataFrame.
`.describe()` calculates a few summary statistics for each column.
homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

In [79]:
homelessness = pd.read_csv("data/homelessness.csv")

In [80]:
# Print the head of the homelessness DataFrame
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


In [81]:
# Print information about the column types and missing values in homelessness
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      51 non-null     int64  
 1   region          51 non-null     object 
 2   state           51 non-null     object 
 3   individuals     51 non-null     float64
 4   family_members  51 non-null     float64
 5   state_pop       51 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB


In [82]:
# Print the shape of homelessness
homelessness.shape

(51, 6)

In [83]:
# Print some summary statistics that describe the homelessness DataFrame
homelessness.describe()

Unnamed: 0.1,Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0,51.0
mean,25.0,7225.784314,3504.882353,6405637.0
std,14.866069,15991.025083,7805.411811,7327258.0
min,0.0,434.0,75.0,577601.0
25%,12.5,1446.5,592.0,1777414.0
50%,25.0,3082.0,1482.0,4461153.0
75%,37.5,6781.5,3196.0,7340946.0
max,50.0,109008.0,52070.0,39461590.0


### Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

`.values`: A two-dimensional NumPy array of values.
`.columns`: An index of columns: the column names.
`.index`: An index for the rows: either row numbers or row names.
You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

In [84]:
# Print a 2D NumPy array of the values in homelessness
display(homelessness.values)

array([[0, 'East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       [1, 'Pacific', 'Alaska', 1434.0, 582.0, 735139],
       [2, 'Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       [3, 'West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       [4, 'Pacific', 'California', 109008.0, 20964.0, 39461588],
       [5, 'Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       [6, 'New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       [7, 'South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       [8, 'South Atlantic', 'District of Columbia', 3770.0, 3134.0,
        701547],
       [9, 'South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       [10, 'South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       [11, 'Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       [12, 'Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       [13, 'East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       [14, 'East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
    

In [85]:
# Print the column index of homelessness
print(homelessness.columns)

Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members',
       'state_pop'],
      dtype='object')


In [86]:
# Print the row index of homelessness
print(homelessness.index)

RangeIndex(start=0, stop=51, step=1)


### Sorting and Subsetting DataFrames

DataFrames often contain a large volume of information, and effectively analyzing this data requires the ability to organize and filter it. **Sorting** allows you to reorder the rows of a DataFrame based on the values in one or more columns, while **subsetting** enables you to select specific rows and/or columns of interest. These operations are fundamental to data exploration and preparation.

#### Sorting Rows

To sort a DataFrame by one or more columns, you can use the `.sort_values()` method. This method returns a new DataFrame with the rows sorted according to the specified column(s).

- **Sorting by a single column**:
  By default, `.sort_values()` sorts in ascending order. To sort in descending order, set the `ascending` argument to `False`.

#### Subsetting Data in Pandas

Subsetting is the process of extracting specific **rows**, **columns**, or both from a DataFrame. This is crucial for data inspection, filtering observations, feature selection, and modular analysis.

Pandas offers several methods for subsetting, grouped into three principal strategies:

### 1. Subsetting Columns

Selecting columns is straightforward and crucial for isolating relevant variables.

- **Select a single column** (returns a `Series`):

```python
df["column_name"]
```

- **Select multiple columns** (returns a `DataFrame`):

```python
df[["col1", "col2", "col3"]]
```

This can be assigned to a new variable for reuse:

```python
features = df[["height", "weight", "bmi"]]
```


### 2. Subsetting Rows

There are two primary methods for selecting rows: by **position** and by **label**.

#### a) Using `.iloc` (Integer-location based)

`.iloc` selects rows by integer position (zero-indexed):

```python
df.iloc[0]         # First row (as Series)
df.iloc[0:5]       # First five rows (as DataFrame)
df.iloc[[1, 3, 5]] # Specific row positions
```

#### b) Using `.loc` (Label-based)

`.loc` selects rows by index **label** (not numerical position). Note that slicing is **inclusive** on both ends:

```python
df.loc["row_label"]
df.loc[["row1", "row3", "row5"]]
df.loc["label1":"label4"]  # Includes both endpoints
```

If the DataFrame uses a RangeIndex (i.e., 0, 1, 2...), `.loc[0]` and `.iloc[0]` behave similarly but are semantically different.


### 3. Conditional Subsetting (Boolean Indexing)

This is one of the most powerful tools in the pandas arsenal and central to exploratory data analysis.

#### Basic condition:

```python
df[df["Age"] > 30]
```

This returns all rows where the value in the `Age` column exceeds 30.

#### Combining conditions:

Use **bitwise logical operators** — not `and`/`or` — along with **parentheses**:

```python
df[(df["Age"] > 30) & (df["Department"] == "Engineering")]
```

- `&` for logical AND
- `|` for logical OR
- `~` for logical NOT

Example of negation:

```python
df[~(df["Status"] == "Inactive")]
```

#### Assigning boolean masks:

To improve clarity and reuse logic:

```python
is_engineer = df["Department"] == "Engineering"
high_salary = df["Salary"] > 100000

df[is_engineer & high_salary]
```

This separates the logic into expressive components.


### 4. Subsetting Rows and Columns Simultaneously

You may use `.loc` to filter rows and columns **at once**:

```python
df.loc[df["Age"] > 30, ["Name", "Salary"]]
```

This returns a new DataFrame where:

- Rows: satisfy the condition `Age > 30`
- Columns: limited to `"Name"` and `"Salary"`

This is highly idiomatic in data cleaning pipelines.

### 5. Subsetting Using `.isin()` (Membership-Based Filtering)

`.isin()` is particularly useful when you need to filter for **multiple matches** in a categorical column, such as selecting all rows whose values belong to a predefined list.

#### Example:

```python
is_black_or_brown = df["Color"].isin(["Black", "Brown"])
df[is_black_or_brown]
```

This selects all rows where the `Color` column is either `"Black"` or `"Brown"`.

#### Alternative inline version:

```python
df[df["Color"].isin(["Black", "Brown"])]
```

This approach is ideal for filtering based on categorical membership — such as departments, statuses, genres, or tags.



#### Best Practices

- When writing complex filters, store boolean masks in variables to improve readability and testability.
- Combine `.loc[]` with conditional subsetting to avoid chained indexing.
- Always verify results with `.head()`, `.shape`, or `.info()` when filtering to ensure logical correctness.


In [87]:
# Sort homelessness by the number of homeless individuals in the individuals column, from smallest to largest, and save this as homelessness_ind. Print the head of the sorted DataFrame
homelessness_ind = homelessness.sort_values(by="individuals", ascending=True)
# Print the top few rows
homelessness_ind.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
50,50,Mountain,Wyoming,434.0,205.0,577601
34,34,West North Central,North Dakota,467.0,75.0,758080
7,7,South Atlantic,Delaware,708.0,374.0,965479
39,39,New England,Rhode Island,747.0,354.0,1058287
45,45,New England,Vermont,780.0,511.0,624358


In [88]:
# Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
homelessness_fam = homelessness.sort_values(by="family_members", ascending=False)
homelessness_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,4,Pacific,California,109008.0,20964.0,39461588
21,21,New England,Massachusetts,6811.0,13257.0,6882635
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
43,43,West South Central,Texas,19199.0,6111.0,28628666


In [89]:
# Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.
homelessness_reg_fam = homelessness.sort_values(by=["region", "family_members"], ascending=[True, False])
# Print the top few rows
homelessness_reg_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
13,13,East North Central,Illinois,6752.0,3891.0,12723071
35,35,East North Central,Ohio,6929.0,3320.0,11676341
22,22,East North Central,Michigan,5209.0,3142.0,9984072
49,49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,14,East North Central,Indiana,3776.0,1482.0,6695497


In [90]:
# Create a Series called individuals that contains only the individuals column of homelessness.
individuals = homelessness["individuals"]
display(individuals.head())

Unnamed: 0,individuals
0,2570.0
1,1434.0
2,7259.0
3,2280.0
4,109008.0


In [91]:
# Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
state_fam = homelessness[["state", "family_members"]]
display(state_fam.head())

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


In [92]:
# Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
ind_state = homelessness[["individuals", "state"]]
display(ind_state.head())

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska
2,7259.0,Arizona
3,2280.0,Arkansas
4,109008.0,California


In [93]:
# Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k.
ind_gt_10k = homelessness[(homelessness["individuals"] > 10_000)]
# See the result
display(ind_gt_10k.head())

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
4,4,Pacific,California,109008.0,20964.0,39461588
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,37,Pacific,Oregon,11139.0,3337.0,4181886
43,43,West South Central,Texas,19199.0,6111.0,28628666


In [94]:
# Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg.
mountain_reg = homelessness[(homelessness["region"] == "Mountain")]
# See the result
display(mountain_reg.head())

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
5,5,Mountain,Colorado,7607.0,3250.0,5691287
12,12,Mountain,Idaho,1297.0,715.0,1750536
26,26,Mountain,Montana,983.0,422.0,1060665
28,28,Mountain,Nevada,7058.0,486.0,3027341


In [95]:
# Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]
# See the result
display(fam_lt_1k_pac.head())

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
1,1,Pacific,Alaska,1434.0,582.0,735139


In [96]:
# Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness
canu = ["California", "Arizona", "Nevada", "Utah"]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]
# See the result
display(mojave_homelessness.head())

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
4,4,Pacific,California,109008.0,20964.0,39461588
28,28,Mountain,Nevada,7058.0,486.0,3027341
44,44,Mountain,Utah,1904.0,972.0,3153550


## Adding New Columns to a DataFrame

The capacity to create new columns is fundamental to any serious data analysis workflow. Whether you are engineering features, deriving metrics, or flagging data for quality checks, Pandas enables you to add columns efficiently and flexibly.


### 1. Adding Columns by Assignment

The most idiomatic way to add a new column is by simple assignment, using the column name as a key.

#### Example: Add a column based on existing data

```python
df["bmi"] = df["weight_kg"] / (df["height_m"] ** 2)
```

This adds a new column called `bmi`, computed from existing columns.


### 2. Adding Columns with a Scalar Value

If you wish to assign the same value to all rows:

```python
df["country"] = "Brazil"
```

Every row now has `"Brazil"` in the `country` column.


### 3. Adding Columns from a List, Array, or Series

You may assign a list, NumPy array, or Pandas Series, provided it matches the DataFrame’s length:

```python
df["random_noise"] = np.random.normal(size=len(df))
```

If the length does not match, Pandas will raise a `ValueError`.


### 4. Adding Columns with `assign()` (Functional Chaining)

For pipeline-oriented code, use `.assign()`, which returns a new DataFrame:

```python
df2 = df.assign(log_salary=np.log(df["salary"]))
```

Multiple columns can be added simultaneously:

```python
df2 = df.assign(
    salary_eur = df["salary"] * 0.92,
    has_bonus = df["bonus"].notna()
)
```

This is ideal for chaining several transformations without mutating the original object.


### 5. Adding Columns Conditionally

It is often necessary to add a column reflecting a condition:

```python
df["is_adult"] = df["age"] >= 18
```

The new column will be of boolean dtype (`True`/`False`). For categorical flags, you may use `.map()` or `np.where()`:

```python
df["age_group"] = np.where(df["age"] < 18, "child", "adult")
```


### 6. Adding Columns from Functions

You can generate new columns by applying a function row-wise:

```python
df["full_name"] = df["first_name"] + " " + df["last_name"]
```

For more complex logic:

```python
def bmi_category(row):
    if row["bmi"] < 18.5:
        return "Underweight"
    elif row["bmi"] < 25:
        return "Normal"
    else:
        return "Overweight"

df["bmi_category"] = df.apply(bmi_category, axis=1)
```

Note: `.apply(..., axis=1)` can be slow for large DataFrames; prefer vectorised methods whenever possible.


### Best Practices

- Prefer vectorised operations for performance and clarity.
- Use `.assign()` in pipelines for non-destructive transformations.
- Always check column alignment when assigning from external sequences.
- When overwriting columns, be explicit and cautious to avoid accidental data loss.


In [97]:
# Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

In [98]:
# Add another column to homelessness, named p_homeless, containing the proportion of the total homeless population to the total population in each state state_pop.
homelessness["p_homeless"] = homelessness["total"] / homelessness["state_pop"]

In [99]:
# See the result
display(homelessness.head())

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless
0,0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.000703
1,1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.002742
2,2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.001378
3,3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0,0.000901
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294


In [100]:
# Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state, using state_pop for state population.
homelessness["indiv_per_10k"] = 10_000 * homelessness["individuals"] / homelessness["state_pop"]
# Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]
# Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
high_homelessness_srt = high_homelessness.sort_values(by="indiv_per_10k", ascending=False)
# Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result.
result = high_homelessness_srt[["state", "indiv_per_10k"]]
# See the result
display(result)

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


## Aggregating DataFrames

Aggregation is the process of collapsing one or more dimensions of a DataFrame into summary statistics. It lies at the core of data analysis—facilitating trend identification, hypothesis testing, and the creation of meaningful summaries from granular data. Pandas offers a rich set of aggregation functions, both simple and cumulative, along with the highly flexible `.agg()` method for multi-statistic summarisation.

### 1. Basic Summary Statistics

#### Mean, Median, and Mode

- **Mean** (`.mean()`): Arithmetic average.
- **Median** (`.median()`): Middle value in ordered data.
- **Mode** (`.mode()`): Most frequent value(s).

```python
df["salary"].mean()
df["salary"].median()
df["salary"].mode()  # returns a Series, as there may be multiple modes
```

- For DataFrames, calling these methods without specifying a column computes statistics **column-wise** for all numeric columns:

```python
df.mean()     # Mean of all numeric columns
df.median()   # Median of all numeric columns
df.mode()     # Mode(s) for all columns (returns DataFrame)
```

### 2. Summarising Dates: `min()` and `max()`

The `.min()` and `.max()` methods apply to both numeric and datetime columns, yielding the smallest and largest values, respectively.

```python
df["date"].min()  # Earliest date
df["date"].max()  # Latest date
```

### 3. The `.agg()` Method: Flexible Multi-Statistic Summarisation

`.agg()` allows you to compute multiple statistics simultaneously, for one or many columns.

#### Aggregating a single column:

```python
df["salary"].agg(["mean", "std", "min", "max"])
```

#### Aggregating multiple columns with different statistics:

```python
df.agg({
    "salary": ["mean", "median"],
    "age": ["min", "max"]
})
```

- The result is a DataFrame with a hierarchical index (MultiIndex), useful for inspection or reporting.

#### Custom functions with `.agg()`:

You may pass user-defined functions, either as named functions or lambdas:

```python
df["salary"].agg(lambda s: s.quantile(0.9))
```

### 4. Aggregating Multiple Columns at Once

With `.agg()` you may specify different sets of statistics for each column:

```python
df.agg({
    "revenue": ["sum", "mean"],
    "expenses": ["sum", "mean"]
})
```

Alternatively, aggregate over all columns:

```python
df[["revenue", "expenses"]].agg(["sum", "mean"])
```

### 5. Cumulative Statistics

Cumulative functions return running totals or extremes, revealing trends or thresholds over a sequence (often chronological).

#### a) Cumulative Sum: `.cumsum()`

Returns the running total:

```python
df["sales"].cumsum()
```

#### b) Cumulative Maximum: `.cummax()`

Tracks the largest value encountered up to each row:

```python
df["score"].cummax()
```

#### c) Cumulative Minimum: `.cummin()`

Tracks the smallest value encountered up to each row:

```python
df["score"].cummin()
```

#### d) Cumulative Product: `.cumprod()`

Returns the running product:

```python
df["growth_factor"].cumprod()
```

All these methods operate column-wise by default and can be used on multiple columns:

```python
df[["sales", "profit"]].cumsum()
```

### Best Practices and Further Considerations

- Aggregations are most powerful when combined with **grouping** (see `groupby`) for stratified summaries.
- For descriptive overviews, use `.describe()` for numeric columns, and `.value_counts()` for categorical data.
- Always inspect the shape and structure of the output—especially when using `.agg()`—to avoid confusion with MultiIndex results.
- When working with large datasets, cumulative functions can reveal seasonality, trends, or threshold events more transparently than simple statistics.

In [101]:
sales = pd.read_csv("data/sales_subset.csv")

In [102]:
# Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


In [103]:
# Print information about the columns in sales.
print(sales.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10774 non-null  int64  
 1   store                 10774 non-null  int64  
 2   type                  10774 non-null  object 
 3   department            10774 non-null  int64  
 4   date                  10774 non-null  object 
 5   weekly_sales          10774 non-null  float64
 6   is_holiday            10774 non-null  bool   
 7   temperature_c         10774 non-null  float64
 8   fuel_price_usd_per_l  10774 non-null  float64
 9   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(3), object(2)
memory usage: 768.2+ KB
None


In [104]:
# Print the mean of the weekly_sales column.
print(sales["weekly_sales"].mean())

23843.95014850566


In [105]:
# Print the median of the weekly_sales column.
print(sales["weekly_sales"].median())

12049.064999999999


### Summarizing dates
Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

In [106]:
# Print the maximum of the date column.
print(sales["date"].max())

2012-10-26


In [107]:
# Print the minimum of the date column.
print(sales["date"].min())

2010-02-05


### Efficient summaries
While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

```
df['column'].agg(function)
```
In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

In [108]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))

16.583333333333336


In [109]:
# Update the column selection to use the custom iqr function with .agg() to print the IQR of temperature_c, fuel_price_usd_per_l, and unemployment, in that order.
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


In [110]:
# Update the aggregation functions called by .agg(): include iqr and np.median in that order.
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


### Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

In [111]:
# Sort the rows of sales_1_1 by the date column in ascending order.
sales_1_1 = sales_1_1.sort_values("date")

In [112]:
# Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

In [113]:
# Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.
sales_1_1["cum_max_sales"] = sales["weekly_sales"].cummax()

In [114]:
# See the columns you calculated
display(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

Unnamed: 0,date,weekly_sales,cum_weekly_sales,cum_max_sales
0,2010-02-05,24924.50,2.492450e+04,24924.50
6437,2010-02-05,38597.52,6.352202e+04,293966.05
1249,2010-02-05,3840.21,6.736223e+04,140504.41
6449,2010-02-05,17590.59,8.495282e+04,293966.05
6461,2010-02-05,4929.87,8.988269e+04,293966.05
...,...,...,...,...
3592,2012-10-05,440.00,2.568932e+08,178982.89
8108,2012-10-05,660.00,2.568938e+08,293966.05
10773,2012-10-05,915.00,2.568947e+08,293966.05
6257,2012-10-12,3.00,2.568947e+08,293966.05


### Counting

Counting is an elemental operation in data exploration, supporting tasks such as deduplication, frequency analysis, and the estimation of proportions or rates. Accurate counting is fundamental to understanding dataset structure, identifying irregularities, and producing summary statistics for categorical and discrete variables. Pandas provides robust, highly-optimised methods for these operations.

### 1. Removing Duplicates: `drop_duplicates(subset=...)`

Duplicate entries can bias frequency analyses and inflate summary statistics. Use `.drop_duplicates()` to remove repeated rows.

- **Basic usage:**

```python
df_unique = df.drop_duplicates()
```

Removes fully duplicated rows.

- **Deduplicate by specific columns:**

```python
df_unique = df.drop_duplicates(subset=["name", "birthdate"])
```

This retains only the first occurrence of each unique combination of `name` and `birthdate`, removing any subsequent duplicates on these fields.

**Note:** By default, `.drop_duplicates()` preserves the first occurrence; set `keep='last'` to retain the last.

### 2. Counting Unique Values: `value_counts()`

The `.value_counts()` method is central for frequency analysis of a Series (i.e., a single column).

#### Basic frequency count:

```python
df["city"].value_counts()
```

Returns a Series with unique values in `"city"` as the index, and their frequencies as values, sorted in descending order by default.

#### Sorting Frequencies: `sort=True/False`

By default, `.value_counts()` sorts results in descending order of frequency (`sort=True`).

- **Sort in ascending order of frequency:**

```python
df["city"].value_counts(ascending=True)
```

- **Disable sorting (preserve appearance order):**

```python
df["city"].value_counts(sort=False)
```

#### Proportional Frequencies: `normalize=True`

For proportions (relative frequencies), set `normalize=True`:

```python
df["city"].value_counts(normalize=True)
```

Each value in the output will be the proportion of total observations (ranging from 0 to 1). This is essential for estimating prevalence, market shares, or demographic rates.

- **Combine options:**

```python
df["city"].value_counts(normalize=True, ascending=True)
```
### Best Practices and Considerations

- Use `drop_duplicates()` before frequency counts if repeated rows represent accidental duplication.
- `.value_counts()` works only on Series; for DataFrames, use `df["column"].value_counts()`.
- For more than one column, use `.groupby()` in conjunction with `.size()` for multi-dimensional frequency tables.
- Frequencies (absolute and proportional) are crucial in EDA, quality assurance, and reporting for both categorical and discrete numerical variables.
