# Module 2 Data Wrangling

## The pre-processing phase of data analysis

Also Known As: Data Cleaning Process
Some popular tasks:
+ Handling **missing values** in data
+ Data Formatting
+ Data Normalization (centering/scaling)
+ Data Binning: **Grouping** data values into bins
+ Converting **categorical variables** into **numerical quantitative variables**

### Simple Dataframe Operations
+ To access a specific column: df["column name"] (df is the dataframe)
+ To manipulate a value to each of entry's column: df["column name"] = df["column name"] + 1 (e.g. add 1)

## Dealing with Missing values

Missing values are usually represented as "?", "N/A", 0 or just a blank cell.<br>
How to deal:
+ 1st Priority: Check the source
+ 2nd Priority: Drop the missing values
    + Drop the variable
    + Drop the data entry
+ 3rd Priority: Replace the missing values
    + Replace it with an average (of similar datapoints) => numerical variables
    + Replace it by frequency => categorical variables
    + Replace it based on the other functions
+ 4th Priority: Just leave it!

### Drop missing values in Python:
dataframe.dropna(subset["column name"], axis=number, inplace=True) <br>

+ axis=0 drops the entire row, axis=1 drops the entire column
+ inplace=True: operation is performed on the orginal object and the result is None;
+ implace=False: operation is performed on a copy of the original object and the result is a new object.

<b>Replace missing values in Python:<br>
dataframe.replace(missing_value, new_value)

For example: If the missing value is NaN and you would like to replace NaN by average value of that column.
+ mean = df["column name"].mean()
+ df["column name"].replace(np.nan, mean)

## Data Formatting

Some popular errors to fix:
+ Wrong data type is assigned to a feature
+ Incorrect unit of measurement
+ Unclear column's name
+ Inconsistent in variables displayed of the same meaning

Change values of an entire column: df["column name"] = ...<br>
Rename column(s): df.rename(columns={"old name": "new name"}, inplace=True)<br>

Correcting data types:
+ dataframe.dtypes() to identify data type
+ dataframe.astype() to convert data type

## Data Normalization

Uniform the features value with difference range
+ Similar value range => usually from 0 to 1
+ Similar intrinsic influence on analytical model (e.g. same scale, same unit of measurement, same bin...)

Methods of normalizing data
1. Simple feature scaling: x_new = x_old / x_max
2. Min-Max: x_new = (x_old - x_min) / (x_max - x_min)
3. Z-score: x_new = (x_old - mean) / standard deviation 

## Data Binning

+ Binning: Grouping of values into "bins"
+ Coverts numeric into categorical variables
+ Group a set of numerical values into a set of "bins"

Take "price" as an example:
+ bins = np.linspace(min(df["price"]), max(df["price"]), 4)
+ group_names = ["Low","Medium","High"]
+ df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, include_lowest=True)

## Turning Categorical variables into quantitative variables

#### Problem: most statistical models cannot take in the objects/strings as input

### Solution:
+ Add dummy variables for each unique category
+ Assign 0 or 1 in each category

=> One-hot encoding
=> pd.get_dummies(df["column name"])