## Data Cleaning
Data cleaning is the technique of eliminating garbage, incorrect, duplicate, corrupted, or incomplete data in a dataset as the part of the data preparation process with a motive to build reliable, uniform and standardized data sets.

THere are four ways you can perform data cleaning:

1.   Drop the missing values
    ```
    df.dropna()
    ```
2.   Replace the missing values
    ```
    from numpy import NaN

    df.replace({NaN:1.00})
    ```
3.   Replace each NaN with a scalar value
    ```
    df.fillna(12)
    ```
4.   Fill the missing values forward or backward
    ```
    df.fillna(method='backfill')
    ```



## Mean/mode/median imputation
We can also do mean/median/mode imputation. For numerical data, we can compute it’s mean or median and use the result to replace missing values and for categorical (non-numerical) data, we can compute its mode to replace the missing value

```
df.salary.fillna(salary_mean,inplace=True)
```

**Hot Deck Imputation** — With this, we can replace the missing value of the observation with a randomly selected value from all the observations in the sample referencing the variables with similar value.

**Rescale Data** — In order to uniformly scale the attributes with varying scales, rescaling is a useful technique to all have the attributes on the same scale using scikit-learn using the MinMaxScaler class.
```
s_m = MinMaxScaler(feature_range=(0, 2))
rescaledX = s_m.fit_transform(X)
```

**Binarize Data** — It’s a very useful process which is generally used during feature engineering to manipulate our data using a binary reference threshold using scikit-learn with the Binarizer class.
```
b_n = Binarizer(threshold = 1.0).fit(X)
b_X = b_n.transform(X)
```

**Regression Imputation** — In order to preserve the relationships between features, we can use regression imputation, basically a technique in which we fit a regression model on a feature with missing data and then using this model predict the values which is used to replace the missing values.

**Stochastic regression imputation** — In this technique, in order to reproduce the correlation of features and labels, we add a random variation to the predicted value.

## Data Augmentation
It’s the technique of increasing the amount and diversity of your training set by applying random transformations.

<div style="text-align:center"><img alt="Data Augmentation" src="https://github.com/thunderstroke325/60-Days-of-Data-Science-and-ML/blob/main/assets/data_augmentation.png?raw=true" /></div>

### Width and Height Shifts
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
    width_shift_range=[-90,-40,0,40,90],
    height_shift_range=[-40,0,40]
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));
```

### Brightness
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
   
    brightness_range=(0.8,4.2)
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));

```

### Shear Transformation
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
    shear_range=46
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));
```

### Zoom
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
    zoom_range=[0.2,3.0]
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));
```

### Channel Shift
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
    channel_shift_range=180
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));
```

### Flips
```
generator = tf.keras.preprocessing.image.ImageDataGenerator(
    horizontal_flip=True,
    vertical_flip=True
)

x, y = next(generator.flow_from_directory('images', batch_size=1))
plt.imshow(x[0].astype('uint8'));
```