In [1]:
import pandas as pd

# `pandas.DataFrame.fillna`

In [15]:
# Create a sample DataFrame with NaN values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, None, 30, None],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,Los Angeles
2,Charlie,30.0,
3,David,,Chicago


In [18]:
# Fill NaN values
df_filled = df.fillna({
    'Age': df['Age'].mean(),     # Fill missing ages with the mean
    'City': 'Unknown'            # Fill missing cities with a placeholder
})
df_filled

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,27.5,Los Angeles
2,Charlie,30.0,Unknown
3,David,27.5,Chicago


# `pandas.DataFrame.drop`

In [2]:
df = pd.DataFrame([{'name': 'Asdas', 'age': 23}, {'name': 'Basdas', 'age': 22}, {'name': 'Casdas', 'age': 21}])
df

Unnamed: 0,name,age
0,Asdas,23
1,Basdas,22
2,Casdas,21


In [3]:
df.drop('age', axis=1)

Unnamed: 0,name
0,Asdas
1,Basdas
2,Casdas


In [4]:
df

Unnamed: 0,name,age
0,Asdas,23
1,Basdas,22
2,Casdas,21


In [5]:
df.drop('age', axis=1, inplace=True)

In [6]:
df

Unnamed: 0,name
0,Asdas
1,Basdas
2,Casdas


In [9]:
df.drop([0,1], axis=0)

Unnamed: 0,name
2,Casdas


In [13]:
# help(df.drop)

# `pandas.DataFrame.describe`

`.describe()` gives you quick summary statistics about your data. It's a first look at the shape and nature of your dataset, especially useful for spotting:
- Ranges of values
- Missing data (indirectly)
- Outliers
- Distribution tendencies (mean, median, etc.)

For numerical columns, it gives:
- `count`: how many non-null entries
- `mean`: the average
- `std`: standard deviation (spread of data)
- `min`: minimum value
- `25%`, `50%`, `75%`: percentiles (especially 50% = median)
- `max`: maximum value

For categorical (object/string) columns, it gives:
- `count`: number of non-null values
- `unique`: number of distinct categories
- `top`: most common value
- `freq`: how often the top value appears

When you run `.describe()` on a single column, the `include` parameter is ignored — it's only meaningful when used on the entire DataFrame.

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Gender': ['Female', 'Male', 'Male', 'Male']
}

df = pd.DataFrame(data)

The cell below does work because you're applying it to the whole DataFrame

In [5]:
df.describe(include='object')

Unnamed: 0,Name,Gender
count,4,4
unique,4,2
top,Alice,Male
freq,1,3


The include parameter is ignored here because it's applied to a single column

In [6]:
df['Gender'].describe(include='object')

count        4
unique       2
top       Male
freq         3
Name: Gender, dtype: object

# `Pandas.DataFrame.dropna()`
- *`data_df = data_df.dropna()`; what does this code do? Does it drop rows where all values are NA, or does it drop rows where even just one column has an NA value?*

The drop`na()` function in pandas drops rows (or columns) that contain missing values (NaN or None). By default, dropna() drops rows where any value is NA. This means that if there's only one column with an NA value in a row, the entire row will be dropped.

If you want to drop rows where all values are NA, you can use the `how='all'` parameter:
- `data_df = data_df.dropna(how='all')`

You can also specify a subset of columns to consider when dropping rows:
- `data_df = data_df.dropna(subset=['column1', 'column2'])`