In [1]:
import pandas as pd


# Pandas Handling Missing Data -

## 1. What is Missing Data?
- **Theory:**
  - Missing data (NaN - Not a Number) occurs when no data value is stored for a variable in an observation.
  - Common in real-world datasets due to data collection issues, human errors, or incomplete records.
  - Pandas represents missing data as `NaN` for numeric data and `None` for object data.

---

## 2. Loading Data with Missing Values

- **Theory:**
  - When reading CSV files, pandas automatically detects missing values and converts them to NaN.
  - `parse_dates` parameter converts date columns to datetime objects.
  - `set_index()` sets a column as the DataFrame index.

- **Example & Implementation:**
    ```python
    import pandas as pd
    
    # Load CSV with date parsing and set date column as index
    df = pd.read_csv("weather_data (2).csv", parse_dates=['day'])
    df.set_index('day', inplace=True)
    df
    ```

---

## 3. Filling Missing Values with `fillna()`

### a) Fill with Single Value
- **Theory:**
  - `fillna()` replaces NaN values with specified values.
  - Can fill all columns with the same value or different values per column.

- **Example & Implementation:**
    ```python
    # Fill all NaN values with 0
    new_df = df.fillna(0)
    
    # Fill different columns with different values
    new_df = df.fillna({
        'temperature': 0,
        'windspeed': 0,
        'event': 'no Event'
    })
    new_df
    ```

### b) Forward Fill and Backward Fill
- **Theory:**
  - `ffill` (forward fill): Uses the last valid observation to fill gaps.
  - `bfill` (backward fill): Uses the next valid observation to fill gaps.
  - Can specify `axis` for row-wise or column-wise filling.

- **Example & Implementation:**
    ```python
    # Forward fill - propagate last valid observation forward
    new_df = df.fillna(method='ffill')
    
    # Backward fill - use next valid observation
    new_df = df.fillna(method='bfill')
    
    # Forward fill along columns (row-wise)
    new_df = df.fillna(method='ffill', axis='columns')
    
    # Backward fill along index (column-wise)
    new_df = df.fillna(method='bfill', axis='index')
    ```

### c) Limit the Number of Fills
- **Theory:**
  - `limit` parameter restricts how many consecutive NaN values to fill.
  - Useful to prevent over-filling in datasets with large gaps.

- **Example & Implementation:**
    ```python
    # Forward fill but limit to only 1 consecutive NaN
    new_df = df.fillna(method='ffill', limit=1)
    new_df
    ```

---

## 4. Interpolation

- **Theory:**
  - Interpolation estimates missing values based on other data points.
  - Linear interpolation draws a straight line between known points.
  - Time interpolation considers the time intervals between data points.

- **Example & Implementation:**
    ```python
    # Linear interpolation
    new_interpolated = df.interpolate()
    
    # Time-based interpolation (for time series data)
    new_interpolated = df.interpolate(method='time')
    new_interpolated
    ```

---

## 5. Dropping Missing Values with `dropna()`

- **Theory:**
  - `dropna()` removes rows or columns containing NaN values.
  - `how='all'`: Drop only if ALL values in row/column are NaN.
  - `thresh=n`: Keep rows with at least n non-NaN values.

- **Example & Implementation:**
    ```python
    # Drop any row with at least one NaN value
    drop_na = df.dropna()
    
    # Drop only rows where ALL values are NaN
    drop_na = df.dropna(how='all')
    
    # Keep rows with at least 1 non-NaN value
    drop_na = df.dropna(thresh=1)
    drop_na
    ```

---

## 6. Reindexing for Missing Dates

- **Theory:**
  - `reindex()` creates a new DataFrame with a specified index.
  - Useful for time series to ensure all dates are present.
  - Missing dates will be filled with NaN values.

- **Example & Implementation:**
    ```python
    # Create a complete date range
    dt = pd.date_range("01-01-2017", "01-11-2017")
    idx = pd.DatetimeIndex(dt)
    
    # Reindex DataFrame to include all dates
    df_reindexed = df.reindex(idx)
    df_reindexed
    ```

---

## 7. Summary of Missing Data Handling Strategies

| Method | Use Case | Pros | Cons |
|--------|----------|------|------|
| `fillna(value)` | Replace with constant | Simple, fast | May introduce bias |
| `fillna(method='ffill')` | Time series gaps | Preserves trends | Can propagate errors |
| `fillna(method='bfill')` | Time series gaps | Uses future data | May not be realistic |
| `interpolate()` | Numeric data | Smooth estimates | Assumes linear relationship |
| `dropna()` | Small amount of missing data | Removes uncertainty | Loses information |
| `reindex()` | Time series with gaps | Ensures complete timeline | Creates more NaN values |

---

## 8. Best Practices

1. **Understand your data**: Investigate why data is missing before choosing a strategy.
2. **Domain knowledge**: Use subject matter expertise to choose appropriate fill values.
3. **Test different methods**: Compare results from different approaches.
4. **Document decisions**: Keep track of how you handled missing data for reproducibility.
5. **Validate results**: Check if your missing data strategy makes sense in context.

---

**These notes cover all the essential pandas methods for handling missing data, with theory, practical examples, and

In [4]:
df=pd.read_csv("weather_data (2).csv", parse_dates=['day'])
df.set_index('day', inplace=True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [16]:
# new_df = df.fillna(0)
# new_df
new_df = df.fillna({
    'temperature': 0,
    'windspeed': 0,
    'event': 'no Event'
})
new_df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,0.0,9.0,Sunny
2017-01-05,28.0,0.0,Snow
2017-01-06,0.0,7.0,no Event
2017-01-07,32.0,0.0,Rain
2017-01-08,0.0,0.0,Sunny
2017-01-09,0.0,0.0,no Event
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [15]:
new_df = df.fillna(method='ffill')
new_df

new_df = df.fillna(method='bfill')
new_df

new_df = df.fillna(method='ffill', axis='columns')
new_df

new_df = df.fillna(method='bfill', axis='index')
new_df

  new_df = df.fillna(method='ffill')
  new_df = df.fillna(method='bfill')
  new_df = df.fillna(method='ffill', axis='columns')
  new_df = df.fillna(method='ffill', axis='columns')
  new_df = df.fillna(method='bfill', axis='index')


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,28.0,9.0,Sunny
2017-01-05,28.0,7.0,Snow
2017-01-06,32.0,7.0,Rain
2017-01-07,32.0,8.0,Rain
2017-01-08,34.0,8.0,Sunny
2017-01-09,34.0,8.0,Cloudy
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [17]:
new_df = df.fillna(method='ffill', limit=1)
new_df

  new_df = df.fillna(method='ffill', limit=1)


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,32.0,9.0,Sunny
2017-01-05,28.0,9.0,Snow
2017-01-06,28.0,7.0,Snow
2017-01-07,32.0,7.0,Rain
2017-01-08,32.0,,Sunny
2017-01-09,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [19]:
new_interpolated = df.interpolate()
new_interpolated

new_interpolated = df.interpolate(method='time')
new_interpolated

  new_interpolated = df.interpolate()
  new_interpolated = df.interpolate(method='time')


Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,29.0,9.0,Sunny
2017-01-05,28.0,8.0,Snow
2017-01-06,30.0,7.0,
2017-01-07,32.0,7.25,Rain
2017-01-08,32.666667,7.5,Sunny
2017-01-09,33.333333,7.75,
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [24]:
drop_na = df.dropna()
drop_na

drop_na = df.dropna(how='all')
drop_na

drop_na = df.dropna(thresh=1)
drop_na

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-01,32.0,6.0,Rain
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-10,34.0,8.0,Cloudy
2017-01-11,40.0,12.0,Sunny


In [25]:
dt = pd.date_range("01-01-2017", "01-11-2017")
idx = pd.DatetimeIndex(dt)
df_reindexed = df.reindex(idx)
df_reindexed

Unnamed: 0,temperature,windspeed,event
2017-01-01,32.0,6.0,Rain
2017-01-02,,,
2017-01-03,,,
2017-01-04,,9.0,Sunny
2017-01-05,28.0,,Snow
2017-01-06,,7.0,
2017-01-07,32.0,,Rain
2017-01-08,,,Sunny
2017-01-09,,,
2017-01-10,34.0,8.0,Cloudy
