Handling Missing Data in Pandas (Simplified):

Real-world data often has missing or incomplete values. Pandas handles this using two main ways:

Masking: A separate array marks values as missing (uses more memory).

Sentinel Values: A special value (like NaN or None) marks missing entries (saves space but limits data range).

Pandas relies on NumPy, which doesn't support missing values for all data types. To keep things efficient, Pandas uses:

NaN for missing numeric data.

None for missing object data.

This approach balances performance and flexibility, even though it's not perfect for every case.

In [1]:
import numpy as np
import pandas as pd

# Creating a NumPy array with a None value
vals1 = np.array([1, None, 3, 4])

# None is not a number, so NumPy uses 'object' dtype to store this mixed-type array
# This means all elements are treated as general Python objects
print(vals1)
# Output: array([1, None, 3, 4], dtype=object)


# Comparing the performance of sum operation on 'object' dtype vs. 'int' dtype
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    # %timeit is an IPython magic command to measure execution time
    # We use np.arange to create a large array and then sum it
    # For 'object' dtype, operations are slower because they're done in Python space
    # For 'int', operations are faster because NumPy uses optimized C code
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()



# Trying to perform sum() on an array with None value
# This will raise a TypeError because Python can't add an int and None
vals1.sum()  

# Error:
# TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'



[1 None 3 4]
dtype = object
48.2 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
3.18 ms ± 203 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)



TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

🔑 Key Takeaways:
None in arrays causes NumPy to treat the array as object, slowing down operations.

Arithmetic with None raises errors since Python doesn't define what 1 + None should be.

It's better to use np.nan for missing values in numerical arrays, as it's compatible with NumPy's operations.

In [2]:
import numpy as np

# Creating a NumPy array with a missing value using np.nan
vals2 = np.array([1, np.nan, 3, 4])

# NaN is a special floating-point value, so NumPy assigns float64 dtype
print(vals2.dtype)  # Output: float64

# NaN behaves like a "data virus" – any math with NaN results in NaN
print(1 + np.nan)   # Output: nan
print(0 * np.nan)   # Output: nan



# Aggregations like sum(), min(), max() on data with NaN return NaN
# Because the NaN makes the result "unknown"
print(vals2.sum(), vals2.min(), vals2.max())  # Output: nan nan nan



# NumPy provides nan-safe versions that ignore NaN during calculations
print(np.nansum(vals2))   # Output: 8.0 (1 + 3 + 4)
print(np.nanmin(vals2))   # Output: 1.0
print(np.nanmax(vals2))   # Output: 4.0



float64
nan
nan
nan nan nan
8.0
1.0
4.0


🔑 Summary:
np.nan is used for missing float values.

It allows NumPy to keep using fast, compiled operations (unlike None).

Normal aggregations with NaN return nan.

Use np.nansum, np.nanmin, np.nanmax to ignore NaN safely.

NaN only works with float types — not with int or string arrays.

In [None]:
import pandas as pd
import numpy as np

# Pandas converts both None and np.nan to NaN in float dtype
s = pd.Series([1, np.nan, 2, None])
print(s)
# Output:
# 0    1.0
# 1    NaN
# 2    2.0
# 3    NaN
# dtype: float64



# Creating an integer Series
x = pd.Series(range(2), dtype=int)
print(x)
# Output:
# 0    0
# 1    1
# dtype: int64

# Assigning None converts int to float and stores NaN
x[0] = None
print(x)
# Output:
# 0    NaN
# 1    1.0
# dtype: float64



| Data Type | When NA is Stored | Becomes | NA Used       |
| --------- | ----------------- | ------- | ------------- |
| float     | No change         | float64 | `np.nan`      |
| object    | No change         | object  | `None/np.nan` |
| integer   | Cast to float64   | float64 | `np.nan`      |
| boolean   | Cast to object    | object  | `None/np.nan` |



Key Points:
None and np.nan are treated the same for most Pandas operations.

Integer and boolean types are upcast to hold missing values.

Strings are stored as object, so both None and np.nan work fine.


| Method      | What it does                        |
| ----------- | ----------------------------------- |
| `isnull()`  | Detects missing values (NaN / None) |
| `notnull()` | Detects non-missing values          |
| `dropna()`  | Removes missing data                |
| `fillna()`  | Fills missing data with a value     |


In [26]:
import pandas as pd
import numpy as np

#just we have a dog related data with some missing values
df = pd.read_csv('dog_data.csv')

print(df.notnull())
# Opposite of isnull() – True where value is NOT missing
# Output: [True  False  False  True]

print("\nfile data is\n", df)

# Using dropna() to remove rows with any missing values
print("\n\n using dropna() to remove rows withh missing values\n\n",df.dropna())
# Removes missing values
# dropna() can drop rows or columns with NaN values
# we can specify axis=0 for rows or axis=1 for columns
# By default, it drops rows with any NaN values
print("\n\n using dropna(axis=1) to remove columns withh missing values\n\n",df.dropna(axis=1))


#similarly dropna() has many parameters such as how='all' to drop only if all values are NaN and thresh=2 to keep rows with at least 2 non-NaN values and subset=['col1', 'col2'] to only consider specific columns for dropping rows.
print("\n\n using dropna(how='all') to remove rows with all values as NaN\n\n", df.dropna(how='all'))#but in our case it will not remove any row as we have at least one value in each row
print("\n\n using dropna(thresh=2) to keep rows with at least 2 non-NaN values\n\n", df.dropna(thresh=2))#it will remove the first row as it has only one value
print("\n\n using dropna(subset=['weight_kg', 'strength']) to only consider specific columns for dropping rows\n\n", df.dropna(subset=['Weight_kg', 'Strength']))#it will remove the first row as it has NaN



# Using fillna() to replace missing values with a specific value
print("\n\n using fillna() to replace missing values with a specific value\n\n", df.fillna(0))
#what if data type is int and we placed string value
print("\n\n using fillna() to replace missing values with a string value\n\n", df.fillna('Unknown'))
# using str in operation of int like mean and sum will raise an error so we should be careful while using fillna() with string values


#suppose we want in Weight_kg NaN value to replace with 11 and in Strength NaN value to replace with 12
print("\n\n using fillna() to replace missing values with specific values for each column\n\n", df.fillna({'Weight_kg': 11, 'Strength': 12}))


#if you want missing values to be replaced with previous value in the column, you can use ffill() method
print("\n\n using df.ffill() to replace missing values with previous value in the column\n\n", df.ffill())

#if you want missing values to be replaced with next value in the column, you can use bfill() method
print("\n\n using df.bfill() to replace missing values with next value in the column\n\n", df.bfill())
# suppose last element of colmn is NaN, then bfill() will not be able to replace it as there is no next value, so it will remain NaN



backup_df = df.copy()  # Make a backup before changes
df.fillna(0, inplace=True)  # Perform in-place modification

# To undo:
df = backup_df.copy()  # Restore from backup



#now we will see use of inplace parameter in fillna() method
# Using fillna() with inplace=True to modify the DataFrame directly
df.fillna(0, inplace=True)# here 0 is used to replace all NaN values in the DataFrame
print("\n\n using fillna() with inplace=True to modify the DataFrame directly\n\n", df)



#inplace modify actual DataFrame, so no need to assign it back to df

#df.fillna(0, inplace=True)  # modifies DataFrame in memory

#df.to_csv('dogs.csv', index=False)  # saves the modified DataFrame back to the CSV file


#similarly filana() also have many parameters and some  important ones are: axis=0 to fill along rows or columns, method='ffill' to use forward fill, limit=1 to limit the number of replacements, and inplace=True to modify the DataFrame directly.



    Dog  Weight_kg  Height_cm  Strength
0  True       True       True      True
1  True       True       True      True
2  True      False       True      True
3  True       True      False      True
4  True       True       True     False
5  True      False       True      True
6  True       True      False      True
7  True       True       True     False

file data is
          Dog  Weight_kg  Height_cm  Strength
0    Bulldog       24.0       40.0      80.0
1   Labrador       30.0       55.0      90.0
2     Poodle        NaN       45.0      70.0
3     Beagle       10.0        NaN      60.0
4      Boxer       28.0       60.0       NaN
5  Dalmatian        NaN       58.0      85.0
6      Husky       27.0        NaN      88.0
7        Pug        8.0       25.0       NaN


 using dropna() to remove rows withh missing values

         Dog  Weight_kg  Height_cm  Strength
0   Bulldog       24.0       40.0      80.0
1  Labrador       30.0       55.0      90.0


 using dropna(axis=1) to remov

In [48]:
df1 = pd.read_csv('dog_data.csv')
#replace and interpolate are also used to handle missing values in pandas DataFrame
# Using replace() to replace specific values in the DataFrame
 #replaced using dictionary
print("\n\n using replace() to replace specific values in the DataFrame\n\n", df1.replace({'Weight_kg': {24: 10}, 'Strength': {88: 10}}))

#suppose we want to replace a to z and A to Z to a certain value 
replaced = df1.replace("[a-zA-Z]","dg",regex = True)
replaced = replaced.infer_objects(copy=False)
print("\n\n using replace() to replace all alphabets with 'dg'\n\n", replaced)


#replacing values with a dictionary in columns
rp = df1.replace({'Dog':'[A-Z]'}, 22, regex=True)
rp = rp.infer_objects(copy=False)
print("\n\n using replace() to replace all alphabets in Dogs column with '22'\n\n", rp)

#replacing data with previous value in the column can also be done using ffill




 using replace() to replace specific values in the DataFrame

          Dog  Weight_kg  Height_cm  Strength
0    Bulldog       10.0       40.0      80.0
1   Labrador       30.0       55.0      90.0
2     Poodle        NaN       45.0      70.0
3     Beagle       10.0        NaN      60.0
4      Boxer       28.0       60.0       NaN
5  Dalmatian        NaN       58.0      85.0
6      Husky       27.0        NaN      10.0
7        Pug        8.0       25.0       NaN


 using replace() to replace all alphabets with 'dg'

                   Dog  Weight_kg  Height_cm  Strength
0      dgdgdgdgdgdgdg       24.0       40.0      80.0
1    dgdgdgdgdgdgdgdg       30.0       55.0      90.0
2        dgdgdgdgdgdg        NaN       45.0      70.0
3        dgdgdgdgdgdg       10.0        NaN      60.0
4          dgdgdgdgdg       28.0       60.0       NaN
5  dgdgdgdgdgdgdgdgdg        NaN       58.0      85.0
6          dgdgdgdgdg       27.0        NaN      88.0
7              dgdgdg        8.0       25.

  rp = df1.replace({'Dog':'[A-Z]'}, 22, regex=True)


In [54]:
# 📌 .interpolate() — Fills missing values using interpolation
# Interpolation means estimating unknown values by using surrounding known values.

# ------------------------------------------------------
# 🧮 LINEAR INTERPOLATION
# ------------------------------------------------------

# 🔸 Here we fill missing numeric values by drawing straight lines between known values.

# Example:
# 'Weight_kg' for Poodle (index 2) was NaN.
# Labrador (index 1) = 30, Beagle (index 3) = 10 → interpolated as (30 + 10)/2 = 20 ✔️

# 'Height_cm' for Beagle (index 3) was NaN.
# Poodle (index 2) = 45, Boxer (index 4) = 60 → linear path → interpolated as 52.5 ✔️

# 'Strength' for Boxer (index 4) was NaN.
# Beagle (index 3) = 60, Dalmatian (index 5) = 85 → interpolated as (60 + 85)/2 = 72.5 ✔️

# 'Strength' for Pug (index 7) was NaN.
# Husky (index 6) = 88 → no value *after* Pug → remains NaN ❌
df2 = pd.read_csv('dog_data.csv')
linear_interpolated = df2.interpolate(method='linear')  # straight-line estimation
print("\n\n using interpolate(method='linear') to fill missing values using linear interpolation\n\n", linear_interpolated)



 using interpolate(method='linear') to fill missing values using linear interpolation

          Dog  Weight_kg  Height_cm  Strength
0    Bulldog       24.0       40.0      80.0
1   Labrador       30.0       55.0      90.0
2     Poodle       20.0       45.0      70.0
3     Beagle       10.0       52.5      60.0
4      Boxer       28.0       60.0      72.5
5  Dalmatian       27.5       58.0      85.0
6      Husky       27.0       41.5      88.0
7        Pug        8.0       25.0      88.0


  linear_interpolated = df2.interpolate(method='linear')  # straight-line estimation


In [53]:
# ------------------------------------------------------
# 📈 POLYNOMIAL INTERPOLATION
# ------------------------------------------------------

# ⚠️ Requires only numeric data, so drop non-numeric 'Dog' column
numeric_df = df1.drop(columns='Dog')

# 🔸 Fits a curve instead of a straight line — order=2 → quadratic: ax² + bx + c

# Example:
# 'Strength' at index 4 (Boxer) is NaN.
# Polynomial curve fit through index 3 = 60 and index 5 = 85, estimated to 70.23 (not 72.5)
# It bends the curve slightly instead of using a straight average ✔️

# 'Weight_kg' for Poodle (index 2) is NaN.
# Polynomial fit gives approx 16.99 (instead of 20) due to curve behavior ✔️

# 'Height_cm' for Beagle (index 3) → filled as 50.11 (not 52.5) due to curvature ✔️

# 'Strength' for Pug (index 7) is still NaN.
# No values after it → polynomial curve can’t extrapolate reliably → stays NaN ❌

# 🚀 Apply interpolation and combine back the 'Dog' column
polynomial_interpolated = numeric_df.interpolate(method='polynomial', order=2)
final_polynomial_df = pd.concat([df1['Dog'], polynomial_interpolated], axis=1)
print("\n\n using interpolate(method='polynomial', order=2) to fill missing values using polynomial interpolation\n\n", final_polynomial_df)



 using interpolate(method='polynomial', order=2) to fill missing values using polynomial interpolation

          Dog  Weight_kg  Height_cm   Strength
0    Bulldog  24.000000  40.000000  80.000000
1   Labrador  30.000000  55.000000  90.000000
2     Poodle  16.991228  45.000000  70.000000
3     Beagle  10.000000  50.110957  60.000000
4      Boxer  28.000000  60.000000  70.231481
5  Dalmatian  34.675439  58.000000  85.000000
6      Husky  27.000000  46.216366  88.000000
7        Pug   8.000000  25.000000        NaN
