# STATS507 Assignment 4
### **Kailan Xu**  
### *October 20, 2021*
***

***
## Question 0 - Topics in Pandas [25 points]

### *Working with missing data*

- Detecting missing data
- Inserting missing data
- Calculations with missing data
- Cleaning / filling missing data
- Dropping axis labels with missing data

### 1. Detecting missing data

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

In [2]:
df = pd.DataFrame(
    np.random.randn(5, 3),
    index=["a", "c", "e", "f", "h"],
    columns=["one", "two", "three"],
)
df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
df2

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
b,,,
c,0.324734,1.527323,-0.457322
d,,,
e,-0.948104,2.252645,1.740101
f,0.250545,-1.338996,-0.284927
g,,,
h,0.745512,0.078105,0.168835


To make detecting missing values easier (and across different array dtypes), pandas provides the `isna()` and `notna()` functions, which are also methods on Series and DataFrame objects:

In [3]:
df2.isna()

Unnamed: 0,one,two,three
a,False,False,False
b,True,True,True
c,False,False,False
d,True,True,True
e,False,False,False
f,False,False,False
g,True,True,True
h,False,False,False


In [4]:
df2.notna()

Unnamed: 0,one,two,three
a,True,True,True
b,False,False,False
c,True,True,True
d,False,False,False
e,True,True,True
f,True,True,True
g,False,False,False
h,True,True,True


###  2. Inserting missing data

You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based on the dtype.
For example, numeric containers will always use NaN regardless of the missing value type chosen:

In [5]:
s = pd.Series([1, 2, 3])
s.loc[0] = None
s

0    NaN
1    2.0
2    3.0
dtype: float64

Likewise, datetime containers will always use NaT.
For object containers, pandas will use the value given:

In [6]:
s = pd.Series(["a", "b", "c"])
s.loc[0] = None
s.loc[1] = np.nan
s

0    None
1     NaN
2       c
dtype: object

### 3. Calculations with missing data

- When summing data, NA (missing) values will be treated as zero.
- If the data are all NA, the result will be 0.
- Cumulative methods like `cumsum()` and `cumprod()` ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use `skipna=False`.

In [7]:
df2

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
b,,,
c,0.324734,1.527323,-0.457322
d,,,
e,-0.948104,2.252645,1.740101
f,0.250545,-1.338996,-0.284927
g,,,
h,0.745512,0.078105,0.168835


In [8]:
df2["one"].sum()

0.9331500980030866

In [9]:
df2.mean(1)

a    0.229099
b         NaN
c    0.464912
d         NaN
e    1.014881
f   -0.457793
g         NaN
h    0.330817
dtype: float64

In [10]:
df2.cumsum()

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
b,,,
c,0.885197,1.79164,-0.594806
d,,,
e,-0.062907,4.044285,1.145296
f,0.187638,2.70529,0.860368
g,,,
h,0.93315,2.783395,1.029204


In [11]:
df2.cumsum(skipna=False)

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
b,,,
c,,,
d,,,
e,,,
f,,,
g,,,
h,,,


### 4. Cleaning / filling missing data

pandas objects are equipped with various data manipulation methods for dealing with missing data.
- `fillna()` can “fill in” NA values with non-NA data in a couple of ways, which we illustrate:

In [12]:
df2.fillna(0)

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
b,0.0,0.0,0.0
c,0.324734,1.527323,-0.457322
d,0.0,0.0,0.0
e,-0.948104,2.252645,1.740101
f,0.250545,-1.338996,-0.284927
g,0.0,0.0,0.0
h,0.745512,0.078105,0.168835


In [13]:
df2["one"].fillna("missing")

a    0.560463
b     missing
c    0.324734
d     missing
e   -0.948104
f    0.250545
g     missing
h    0.745512
Name: one, dtype: object

### 5.Dropping axis labels with missing data

You may wish to simply exclude labels from a data set which refer to missing data. To do this, use `dropna()`:

In [14]:
df2.dropna(axis=0)

Unnamed: 0,one,two,three
a,0.560463,0.264317,-0.137484
c,0.324734,1.527323,-0.457322
e,-0.948104,2.252645,1.740101
f,0.250545,-1.338996,-0.284927
h,0.745512,0.078105,0.168835
