## Introduction to Missing Data 

### Trade-Offs in Missing Data Conventions


### Missing Data in Pandas




### None: Pythonic missing data


### NaN: Missing numerical data



In [None]:
vals2 = np.array([1, np.nan, 3, 4])

### NaN and None in Pandas



In [67]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64


<table>
<thead><tr>
<th>Typeclass</th>
<th>Conversion When Storing NAs</th>
<th>NA Sentinel Value</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>floating</code></td>
<td>No change</td>
<td><code>np.nan</code></td>
</tr>
<tr>
<td><code>object</code></td>
<td>No change</td>
<td><code>None</code> or <code>np.nan</code></td>
</tr>
<tr>
<td><code>integer</code></td>
<td>Cast to <code>float64</code></td>
<td><code>np.nan</code></td>
</tr>
<tr>
<td><code>boolean</code></td>
<td>Cast to <code>object</code></td>
<td><code>None</code> or <code>np.nan</code></td>
</tr>
</tbody>
</table>


### Operating on Null Values



- `isnull()`: Generate a boolean mask indicating missing values
- `notnull()`: Opposite of isnull()
- `dropna()`: Return a filtered version of the data
- `fillna()`: Return a copy of the data with missing values filled or imputed



#### Detecting null values



In [72]:
data = pd.Series([1, np.nan, 'hello', None])

#### Dropping null values


In [77]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])

#### Filling null values



In [85]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))

## Dropping Missing Values

## Filling Missing Values

## Removing Duplicates

In [3]:
zenbook_model = 'ZenBook UX305CA-UBM1'

## Replacing Values

## Dropping Columns 

## Vaja

### Convert the price_euros column to a numeric dtype.

### Extract the screen resolution from the screen column.

### Extract the processor speed from the cpu column.

## Save clean data to CSV file

## Analiza

### Are laptops made by Apple more expensive than those made by other manufacturers?


### What is the best value laptop with a screen size of 15" or more?
            

### Which laptop has the most RAM?

## Working With Missing Data

### Introduction

In [3]:
happiness2015 = pd.read_csv('data/wh_2015.csv') 
happiness2016 = pd.read_csv('data/wh_2016.csv') 
happiness2017 = pd.read_csv('data/wh_2017.csv')

In [None]:
shape_2015 = happiness2015.shape
shape_2016 = happiness2016.shape
shape_2017 = happiness2017.shape

In [None]:
shape_2015

In [None]:
shape_2016

In [None]:
shape_2017

### Identifying Missing Values

### Correcting Data Cleaning Errors that Result in Missing Values

### Visualizing Missing Data

### Using Data From Additional Sources to Fill in Missing Values

In [None]:
regions2015 = happiness2015[['COUNTRY', 'REGION']].copy()
regions2016 = happiness2016[['COUNTRY', 'REGION']].copy()


### Identifying Duplicates Values

### Correcting Duplicates Values

### Handle Missing Values by Dropping Columns

In [None]:
columns_to_drop = ['LOWER CONFIDENCE INTERVAL', 'STANDARD ERROR', 
                   'UPPER CONFIDENCE INTERVAL', 'WHISKER HIGH', 
                   'WHISKER LOW']

### Analyzing Missing Data

### Handling Missing Values with Imputation

### Dropping Rows

## Identifying Hidden Missing Data

### Primer: Happiness 2015

In [12]:
happiness2015 = pd.read_csv('data/wh_2015_special.csv')

### Primer: Diabetes

In [None]:
diabetes = pd.read_csv('data/pima-indians-diabetes_data.csv')

#### Analyzing missingness percentage

## Andvance Visualization of Missing Data

In [None]:
# Import missingno as msno
import missingno as msno
import matplotlib.pyplot as plt
%matplotlib inline

### Missingness Patterns

## Handle Missing Values

### Dropping Rows

### Imputation Techniques

#### Mean & median imputation


#### Mode and constant imputation

#### Visualize imputations

In [None]:



imputations = {'Mean Imputation': diabetes_mean, 'Median Imputation': diabetes_median, 
               'Most Frequent Imputation': diabetes_mode, 'Constant Imputation': diabetes_constant}

