# Pre-Processing Data


## Load libraries

In [1]:
#!pip install pandas
#!pip install numpy
#!pip install matplotlib
#!pip install seaborn
#!pip install datetime

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# make plot show up without plt.show()
%matplotlib inline

# plot configurations
plt.rcParams["figure.figsize"] = (10, 8) 
plt.style.use('fivethirtyeight')

# date time library
import datetime as dt

## Helper functions

In [3]:
def missing_cols(df):
    total = 0
    for col in df.columns:
        missing_vals = df[col].isnull().sum()
        pct = df[col].isnull().mean() * 100 
        if missing_vals != 0:
          print('{} => {} [{}%]'.format(col, df[col].isnull().sum(), round(pct, 2)))
        total += missing_vals
    
    if total == 0:
        print("no missing values left 🎉 ")

def get_memory(df):
  print(f"{round(df.memory_usage().sum() / 1000000, 2)} MB")

## Load data

In [4]:
# data from github repo: https://raw.githubusercontent.com/notelai/data/master/rawdata_AB_NYC_2019.csv
url = 'https://raw.githubusercontent.com/notelai/data/master/rawdata_AB_NYC_2019.csv'

df = pd.read_csv(url)

### df shape 


In [91]:
 # (no. of rows, no. of columns)

### summary, data types, first 5 rows of data, last five rows of data, randomly sample rows

## Rename columns

In [92]:
# show columns of dataframe


In [93]:
 # turn into list

### string methods

In [94]:
# replace whitespace with underscore '_'


# lower case column names



### pandas rename

In [95]:
# {'original_name' : 'new_name'}

# only returning output


### inplace = True

- updates the data frame itself instead of returning an output

## Remove labels from values

- dollar sign -> Mixture of string and integer
- cause problems when filling in missing values or converting data types
- remove dollar sign and labels similar



### Converting to numeric column "price"

### Convert to category

- `neighbourhood_group`
- `neighbourhood`
- `room_type`

Guidelines for conversion
- to reduce memory and increase performance with operations related to categorical data
- make sure data is clean before converting it


### Convert to datetime 

### Changing numeric types

ex: changing int types (int8 | int16 | int32 | int64)

- the numbers stands for bit
- int8 can store integers from -128 to 127.
- int16 can store integers from -32768 to 32767.
- int64 can store integers from -9223372036854775808 to 9223372036854775807.

## Missing data

### Checking which columns have missing data

### summing the missing values

### percentage of missing data

### How to deal with missing data?

Dealing with missing data is not simple task, you have need to consider why the data is missing in the first place, and domain knowledge to know what to impute. 

There also isn't a specific threshold for what percentage of missing data is accepted, it depends on the data.

If you mess it up, you will introduce bias to your data.  


#### Techniques 
1. Drop feature 
1. Drop the rows
1. Impute missing values (manually or automatically)


#### Dropping columns with missing values

- usually worst strategy unless it has a lot of missing data (over 80 or 90%), or feature is not useful

#### Remove rows with missing values
- Losing even more information (from other columns), so not the best method

#### Impute missing values


- A constant value related to the data, such as 0 for `number_of_reviews`, or "None" for `listing_name`.
- The value before or after the data point (backward fill, forward fill)
- Summary statistics such as mean, median or mode value for the column.
- A value estimated by algorithms or ML models like KNN.
 

##### Imputing manually with pandas fillna

##### Imputing with bfill and ffill

bfill
- bfill stands for backwards fill
- means filling in missing value with value after it (fill it backwards)

ffill
- ffill stands for forward fill
- means filling in missing value with value before it (fill it forwards)


## Explode Date column

## Data Inconsistencies

### out of range data

#### Are the values in the column `availability_365` within 365 ?

##### Assert keyword

#### Are lat and long coords valid coordinates?

- The latitude must be a number between -90 and 90 and the longitude between -180 and 180

Since you know the lat and long coords are for new york, you can ask the question whether the coords are within NYC itself

### Categorical data inconsistency

- caused by human error (wrong spelling, different case categories)

Usually it isn't so easy to deal with these value consistency since there could be hundreds of categories like the neighborhood column which has over 200 types, there are two ways to deal with this issue

1. **Preprocess the text (lowercase, strip whitespace)**
1. Use fuzzy matching to find similar words, and replace them. 

### Duplicate rows

## Outliers

What?
- data point that is far from other observations in our data
- it arises from to erorrs in data collection or due to the influence of various factors on data
- when there are outliers which indicates erroneous or abnormal data then we can either remove them or correct them.

How to detect?
- with boxplots and histograms
- statistical methods like IQR, skewness, etc.

> To choose the best way to handle outliers one must have a good domain knowledge and information about where the data come from and what they mean ; it also depends on what analysis one is planning to perform.


### Continuous data

#### plotting a histogram

#### Plotting a boxplot

### Categorical data

#### Plotting a bar plot



## Save final cleaned dataset