## 2.2 Missing Values and data type transformations

### Why is it a problem?
- adds ambiguity to analysis
- EX. computing mean with missing data, have to make assumptions about missing data

### Solutions
Depends on Several factors: problem domain / % missing data / others..

No magic solution
- drop columns or rows of data with missing values
- **Missing Value Imputation** replace missing data with some values

#### Missing Value Imputation
- replace missing value with fixed value
    - 0 or any other number if col numeric
    - mean or median of existing values for numeric col
    - categorical - mode of existing entries
- take educated guesses using machine/statistical learnin models
- analysis methods specifically designed for missing value analyzing
   
### Data Type Transformation
Common to convert categorical data to numeric for analyzing (e.g. artificial neural networks)

### Binary Encoding or (*one-hot encoding*) (categorical to numeric)
- common technique
- procedure:
    - categorical data has 'n' categories => create 'n' new binary columns
    - for each row - set all vals to 0 except col that represents the categorical val => 1
    - **Dummy Encoding** 
         - possible (better) to create 'n-1' binary cols b/c one categorical value can be assumed when all values in the n-1 cols for the row are 0
        - Ex. 3 options: red blue green, if red and blue binary cols both 0, know that last col is 1 for green
        
### Binning or Discretisation (numeric to categorical)
Transform numerical data into bins or intervals
- **Supervised** involving class (or target) col
    - entorpy based binning
- **Unsupervised:** not involving class col
    - Equal width binning
        - numeric data sorted and divided into 'n' intervals (bins)
        - width of each bin: w = (max - min) / n
        - equal range bins regardless of frequency
    - Equal Frequency Binning
        - divide data into 'n' bins so that each bin have approx same number of cases

## 2.4 Readings - Pandas Docs

[Working with missing data - Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

To make detecting missing values easier (and across different array dtypes), pandas provides the `isna()` and `notna()` functions, which are also methods on Series and DataFrame objects:

[isna docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html#pandas.isna)
- detect missing values for an array like object
- indicates whether values are missing
    - NaN in numeric arrays
    - None or NaN in object arrays
    - NaT in datetimelike arrays

[notna docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html#pandas.notna)

### Datetimes
NaT represents missing values for datetime64 types


### Inserting missing data
Can insert missing values by simply assigning to containers - actual missing val chosen based on dtype

### Calcs with missing data
missing values propogate naturally through arithmetic ops b/w pandas objects

*Ex. add a NaN + 3 => NaN*

Discriptive stats written to account for missing data
- when summing data NA values treated as 0
- if data all NA result is 0
- cumulative methods like `cumsum()` and `cumprod` ignore NA vals by default but preserve them in resulting arrays
    - can override with `skipna=False`

### Filling Missing Values `fillna`

[fillna() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)

### Filling with a Pandas Object
Can fillna using a dict or Series that is alignable

Labels of the dict or index of the Series must match the cols of the frame you wish to fill

Use case: fill a DF with mean of that column

Ex. `data.fillna(data.mean())` or select certain col titles, skip a, do b,c: `data.fillna(data.mean()['B':'C'])`

### Drop axis lables with missing data `dropna`
exclude labels from a data set which refer to missing data

[dropna() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna)

rows with empty data: `data.dropna(axis=0)` OR  cols: `data.dropna(axi-=1)`

### Interpolation `interpolate()`

performs linear intorpolation at missing data points

`data.intorpolate()` - `data.interpolate(method='time')` - `method.interpolate(method='values')`

method option does index aware - time for time, values for floating point index

scipy routines
- quadratic: time searies growing at increasing rate
- pchip: cumulative distribution func.
- akima: smooth plotting


### Replacing generic values `replace()`

[replace() docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace)

replace in series and replace in DF - different funcs

for series can replace single val or list of values by another value

`ser.replace(0,5)` or `ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])`

DataFrame - can specify individual values by column:

`df.replace({"a": 0, "b": 5}, 100)`

### String / Regex replacement


Replace the '.' with NaN => `df.replace('.', np.nan)`

Replace surrounding whitespace => `df.replace(r"\s*\.\s*", np.nan, regex=True)`

Replace a few values => `df.replace(["a", "."], ["b", np.nan])`

Only search in column 'b' (dict -> dict): `df.replace({"b": "."}, {"b": np.nan})`


### Numeric Replacement
similar to fillna()

`df.replace(1.5, np.nan)`