# <u>Missing Data.

A lot of times when we're using Pandas to read in data if we have missing points. What's going to happen is Pandas will automatically fill in that missing point with a null value (NaN value).

Let's show a few convenient methods to deal with Missing Data in pandas:

In [2]:
import numpy as np
import pandas as pd

---

# <u>Creating a DataFrame with a dictionary.

In [5]:
# Keys ('A', 'B', 'C') are going to be treated as columns in our DataFrame.
# np.nan signifies a null value.
# Column A has 1 missing value, Column B has 2 missing values and Column C has no missing values.

d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}

In [6]:
df = pd.DataFrame(d)
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


- <u>NOTE:

    - Here’s what’s happening step-by-step — Pandas is basically turning the dictionary into a 2D table (matrix):
        - d = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}
        - Keys ('A', 'B', 'C') → become column names.
        - Values (lists like [1, 2, np.nan]) → become the column data.
        - So mentally, it’s like we already have three vertical columns:
            - A: 1, 2, NaN
            - B: 5, NaN, NaN
            - C: 1, 2, 3
        - When we do: df: Pandas looks at the position of each element in the list to decide which row index it belongs to.

---

# <u>dropna():

    This method removes missing values from the DataFrame.
    Syntax: df.dropna(axis = 0, thresh = int(None), inplace = False)
            where;
                    thresh = Requires that many non-NaN values to not drop the rows or columns (Threshold).

In [10]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


- <u>NOTE:

    - When we don't specify any arguments in df.dropna(), it takes in the default values.
    - df.dropna(axis = 0, thresh = int(None), inplace = False)
    - It drops rows with one or more missing values (NaN values).

In [12]:
# When we want to drop columns with one or more missing values. 

df.dropna(axis = 1)

Unnamed: 0,C
0,1
1,2
2,3


#### We can also specify a Threshold.

Don't drop Rows with 2 non-NaN values:

In [15]:
df.dropna(thresh = 2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


---

# <u>fillna():

    Fills NA/NaN values using the specified method in the DataFrame.
    Syntax: df.fillna(value = 'FILL VALUE', method,  axis = 0, inplace = False)

In [18]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [19]:
# Fills all NaN values with the string 'FILL VALUE'.

df.fillna(value = 'FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


#### Example: Fill the NaN values in column A with the mean.

In [21]:
df['A']

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

In [22]:
df['A'].fillna(value = df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

In [38]:
# Another example.

df.loc[1, :].fillna(value = 'SomeValue', inplace = False)

A          2.0
B    SomeValue
C          2.0
Name: 1, dtype: object

- <u>NOTE:

    - Now there's a whole philosophy and way of thinking and statistical methods for filling in missing values appropriately.
    - They really depend on what kind of data we're working with.

---