### Module 2: Data Wrangling

Data pre-processing techniques necessary for data analysis, often referred to as data cleaning or data wrangling. Key topics covered include:

1. Identifying and Handling Missing Values:
   - Missing values occur when data entries are left empty.
   - Methods to identify and manage these missing values will be discussed.

2. Standardizing Data Formats:
   - Data from different sources may be in various formats, units, or conventions.
   - Methods in Python Pandas to standardize these values will be introduced.

3. Data Normalization:
   - Different numerical data columns may have varying ranges, making direct comparison difficult.
   - Normalization techniques, such as centering and scaling, will be focused on to bring data into a similar range for better comparison.

4. Data Binning:
   - Binning creates larger categories from numerical values.
   - This technique is useful for comparing groups of data.

5. Converting Categorical Variables:
   - Categorical values need to be converted into numeric variables for statistical modeling.
   - Methods to perform this conversion will be demonstrated.

By covering these techniques, the video aims to prepare raw data for further analysis, ensuring it is clean, standardized, and ready for meaningful comparison and statistical modeling.


### Managing Missing Data in Python

In this video, we will introduce the pervasive problem of missing values as well as strategies on what to do when you encounter missing values in your data. 

#### Understanding Missing Values
- Missing values occur when data entries are left empty or represented by symbols like question marks, N/A, zero, or blank cells.
- Dealing with missing data is crucial for accurate analysis and modeling.

#### Strategies for Handling Missing Values
1. **Data Retrieval:** 
   - Contact the data collector to retrieve the actual missing values if possible.

2. **Data Removal:**
   - Drop the data entry or variable containing the missing value.
   - Use `dropna()` method in Pandas to drop rows or columns with missing values (`NaN`).

3. **Data Replacement:**
   - Replace missing values with estimated values.
   - Common techniques include replacing missing values with:
     - Mean, median, or mode of the variable.
     - Specific values based on domain knowledge or additional information.

4. **Leave as Missing:**
   - Sometimes, it's appropriate to leave missing values as is, especially if removing or replacing them may introduce bias or distort the data.

#### Handling Missing Data in Python
- **Removing Missing Values:**
  - Use the `dropna()` method in Pandas to drop rows or columns with missing values.
  - Specify `axis=0` to drop rows or `axis=1` to drop columns.
  - Example: `dataframe.dropna(axis=0, inplace=True)` to drop rows with missing values.

- **Replacing Missing Values:**
  - Use the `replace()` method in Pandas to replace missing values with specific values.
  - Example: `dataframe['normalized_losses'].replace(np.nan, mean_value, inplace=True)` to replace missing values in a specific column.

#### Conclusion
So we've gone through two ways in Python to deal with missing data. We learned to drop problematic rows or columns containing missing values, and then we learned how to replace missing values with other values. But don't forget the other ways to deal with missing data. You can always check for a higher quality data set or source, or in some cases, you may want to leave the missing data as missing data.


### Handling Data Formats in Python

In this video, we'll look at the problem of data with different formats, units, and conventions and the Pandas methods that help us deal with these issues.

#### Understanding Data Formatting
- Data collected from various sources may be stored in different formats, units, or conventions.
- Data formatting brings data into a common standard of expression, ensuring consistency and facilitating meaningful comparisons.

#### Example: Standardizing Fuel Consumption Units
- Consider a dataset with a feature named "city-miles per gallon" representing car fuel consumption in miles per gallon.
- To convert miles per gallon to liters per 100 kilometers (metric version), we need to divide 235 by each value in the "city miles per gallon" column.
- In Python, this transformation can be done in one line of code: `df['city-miles per gallon'] = 235 / df['city-miles per gallon']`.
- Additionally, rename the column to "city-liters per 100 kilometers" using the `rename()` method:
  ```python
  df.rename(columns={'city-miles per gallon': 'city-liters per 100 kilometers'}, inplace=True)
  ```

#### Data Type Correction
- Sometimes, the data type may be incorrectly established during dataset import.
- For example, the "price" feature may be assigned the data type "object" when it should be an integer or float.
- It's crucial to explore the data types of features and convert them to the correct data types for accurate analysis.
- Use `dtypes` method to identify a feature's data type:
  ```python
  print(df.dtypes)
  ```
- To convert data types, use the `astype()` method. For example, to convert the "price" column from object to integer:
  ```python
  df['price'] = df['price'].astype("int")
  ```

#### Conclusion
Data formatting is an essential step in data cleaning to ensure consistency and facilitate meaningful analysis. By standardizing data formats and correcting data types, we can prepare the dataset for further analysis and modeling.
```


```markdown
### Handling Data Formats in Python

In this video, we'll look at the problem of data with different formats, units, and conventions and the Pandas methods that help us deal with these issues.

#### Understanding Data Formatting
- Data collected from various sources may be stored in different formats, units, or conventions.
- Data formatting brings data into a common standard of expression, ensuring consistency and facilitating meaningful comparisons.

#### Example: Standardizing Fuel Consumption Units
- Consider a dataset with a feature named "city-miles per gallon" representing car fuel consumption in miles per gallon.
- To convert miles per gallon to liters per 100 kilometers (metric version), we need to divide 235 by each value in the "city miles per gallon" column.
- In Python, this transformation can be done in one line of code: 
  ```python
  df['city-miles per gallon'] = 235 / df['city-miles per gallon']
  ```
- Additionally, rename the column to "city-liters per 100 kilometers" using the `rename()` method:
  ```python
  df.rename(columns={'city-miles per gallon': 'city-liters per 100 kilometers'}, inplace=True)
  ```

#### Data Type Correction
- Sometimes, the data type may be incorrectly established during dataset import.
- For example, the "price" feature may be assigned the data type "object" when it should be an integer or float.
- It's crucial to explore the data types of features and convert them to the correct data types for accurate analysis.
- Use `dtypes` method to identify a feature's data type:
  ```python
  print(df.dtypes)
  ```
- To convert data types, use the `astype()` method. For example, to convert the "price" column from object to integer:
  ```python
  df['price'] = df['price'].astype("int")
  ```

#### Conclusion
Data formatting is an essential step in data cleaning to ensure consistency and facilitate meaningful analysis. By standardizing data formats and correcting data types, we can prepare the dataset for further analysis and modeling.
```


## Summary

In this video, we explore the challenges of dealing with data in different formats, units, and conventions, and the Pandas methods that help address these issues. Data collected from various sources can often have inconsistent formats, making it necessary to standardize the data for meaningful analysis.

### Key Points

1. **Data Formatting**: 
    - Data formatting involves standardizing the expression of data to allow for meaningful comparisons.
    - Example: Representations of "New York City" such as "N.Y.", "Ny", "NY", and "New York" need to be standardized unless analyzing variations is the goal.

2. **Unit Conversion**:
    - Converting units ensures consistency, particularly when data is used internationally.
    - Example: Converting fuel consumption from miles per gallon (mpg) to liters per 100 kilometers (L/100km) using the formula: 
      \[
      \text{city\_liters per 100 kilometers} = \frac{235}{\text{city miles per gallon}}
      \]
    - In Python:
      ```python
      df['city-liters per 100 kilometers'] = 235 / df['city-miles per gallon']
      df.rename(columns={'city-miles per gallon': 'city-liters per 100 kilometers'}, inplace=True)
      ```

3. **Correcting Data Types**:
    - Incorrect data types can lead to errors in analysis and model development.
    - Example: A price column might be incorrectly set as an object instead of a numerical type (integer or float).
    - To identify data types, use:
      ```python
      df.dtypes
      ```
    - To convert data types, use:
      ```python
      df['price'] = df['price'].astype(int)
      ```

### Practical Examples

- **Standardizing Text Data**:
  ```python
  df['city'] = df['city'].str.replace('N.Y.', 'New York')
  df['city'] = df['city'].str.replace('Ny', 'New York')
  df['city'] = df['city'].str.replace('NY', 'New York')
  ```

- **Unit Conversion Example**:
  ```python
  df['city-liters per 100 kilometers'] = 235 / df['city-miles per gallon']
  df.rename(columns={'city-miles per gallon': 'city-liters per 100 kilometers'}, inplace=True)
  ```

- **Converting Data Types**:
  ```python
  df['price'] = df['price'].astype(float)
  ```

### Conclusion

Standardizing data formats, units, and correcting data types are essential steps in data cleaning to ensure accurate and meaningful analysis. Pandas provides powerful methods like `replace`, `rename`, and `astype` to facilitate these tasks.

# Data Normalization in Machine Learning

This markdown provides a comprehensive overview of data normalization, a fundamental technique in machine learning for preparing data for analysis and modeling.

**Understanding Normalization**

Data normalization refers to the process of transforming features within a dataset to have a consistent value range. This is particularly important when dealing with datasets containing features measured on vastly different scales. Normalization ensures that all features contribute equally to statistical analyses and machine learning models, preventing features with larger scales from dominating the results.

**Importance of Normalization**

There are two primary reasons why normalization is crucial in machine learning:

1. **Enhanced Statistical Analysis:** Many statistical methods employed in machine learning rely on the assumption that features have comparable scales. Normalization ensures features contribute equally to analyses, leading to more reliable and interpretable results.

2. **Fairer Model Building:** When features have significantly different scales, machine learning algorithms like linear regression can become biased towards features with larger values. Normalization mitigates this bias by placing all features on a similar scale, fostering fairer model behavior and more accurate predictions.

**Illustrative Example: Age vs. Income**

Consider a dataset with two features: age (ranging from 0 to 100) and income (ranging from 0 to potentially millions). Without normalization, income's vast range would overwhelm age in analyses like linear regression. By normalizing both features to a range like 0-1, we ensure that both features contribute equally to the model and prevent income from biasing the results towards higher incomes.

**Common Normalization Techniques**

Several effective normalization techniques are used in machine learning. Here, we explore three prominent methods:

1. **Simple Feature Scaling:** This method scales each feature by dividing its values by the maximum value within that feature. The resulting values typically range from 0 to 1.

2. **Min-Max Scaling:** This technique transforms each feature value by subtracting the minimum value in the feature and then dividing by the range (difference between maximum and minimum). Similar to simple scaling, the resulting values fall between 0 and 1. 

3. **Z-score Normalization (Standardization):** This method calculates the mean (average) and standard deviation of each feature. Each value is then transformed by subtracting the mean and dividing by the standard deviation. The resulting values typically cluster around 0, with a range of -3 to +3 (although they can fall outside this range).

**Implementation Considerations**

While this summary omits specific code examples, it's important to note that popular libraries like Pandas in Python provide functionalities for all these normalization techniques. The choice of normalization technique can depend on the specific data and modeling requirements.

By effectively applying data normalization techniques, you can significantly improve the fairness, accuracy, and overall effectiveness of your machine learning models.


## Data Binning for Preprocessing and Analysis in Python

This markdown explains data binning, a technique for grouping numerical data into categories (bins) for better understanding and model building.

**What is Binning?**

Binning involves dividing a continuous numerical feature into a set of discrete intervals (bins). It's like grouping similar values together.

**Benefits of Binning**

* **Improved Model Accuracy:** Binning can sometimes improve the accuracy of predictive models by simplifying the data and reducing noise.
* **Data Understanding:** Binning helps visualize the distribution of data by creating categories. It can reveal patterns or trends in the data that might be hidden in raw numerical values.

**Example: Analyzing Car Prices**

Imagine a "price" feature in a car dataset, ranging from $5,000 to $45,500. Binning allows us to categorize these prices into groups like "low," "medium," and "high" for better comprehension.

**Binning Implementation in Python**

Here's how to implement binning with Python libraries:

1. **Define Bins:** Decide on the number of bins and their ranges. We'll create three bins (low, medium, high) of equal width.

2. **Calculate Bin Dividers:** Use NumPy's `linspace` function to generate equally spaced dividers between the minimum and maximum price values.

3. **Assign Bin Names:** Create a list of names for each bin (e.g., "low_price," "medium_price," "high_price").

4. **Binning Data:** Use Pandas' `cut` function to segment the "price" feature values into the corresponding bins based on the dividers.

5. **Visualize Results:** Employ histograms to visualize the distribution of data after binning. This can reveal patterns like how many cars fall into each price category.

**Summary**

Binning is a valuable technique for transforming continuous numerical data into discrete categories. It can lead to improved model performance and provide a clearer understanding of data distribution.


## One-Hot Encoding Categorical Variables in Python

This markdown summarizes converting categorical variables into quantitative variables in Python for machine learning models.

**Why One-Hot Encoding?**

Many statistical models require numerical inputs, while categorical variables contain strings or text labels. One-hot encoding addresses this by transforming categorical variables into numerical features suitable for modeling.

**Example: Fuel Type in Cars**

Consider a car dataset with a "fuel type" feature containing values like "gas" and "diesel" (categorical). To use this feature in a model, we need to convert it to a numerical format.

**One-Hot Encoding Technique**

One-hot encoding introduces new features (dummy variables) for each unique category in the original categorical variable. Each new feature represents a specific category.

* A value of 1 indicates membership in that category.
* A value of 0 indicates non-membership.

**Example: Encoding "Fuel Type"**

The original "fuel type" feature has two categories: "gas" and "diesel." We create two new features: "gas" and "diesel."

* For cars with "gas," the "gas" feature is set to 1, and "diesel" is set to 0.
* For cars with "diesel," the "diesel" feature is set to 1, and "gas" is set to 0.

This effectively converts the categorical information into numerical representations for modeling purposes.

**One-Hot Encoding in Pandas**

Pandas provides the `get_dummies` method for conveniently converting categorical variables into dummy variables.

**Example Code (Using `get_dummies`):**

```python
import pandas as pd

# Sample data (replace with your actual data)
data = {'fuel_type': ['gas', 'diesel', 'gas']}
df = pd.DataFrame(data)

# One-hot encoding using get_dummies
dummy_variables_one = pd.get_dummies(df['fuel_type'])

# Resulting data frame with dummy variables
print(dummy_variables_one)
```

By using one-hot encoding, you can prepare your categorical data for use in various machine learning models.


## Data Preprocessing Techniques

This summary highlights the key data preprocessing techniques covered in this lesson, equipping you to effectively prepare data for machine learning models.

**Data Formatting**

* Standardize data formats (e.g., capitalization, abbreviations) for consistent analysis.
* Ensure data from different sources are comparable by addressing inconsistencies.

**Unit Conversion**

* Convert units of measurement (e.g., miles per gallon to liters per 100 kilometers) for better analysis and interpretation.
* Utilize Python libraries like Pandas to perform unit conversions for numerical features.

**Data Type Handling**

* Identify and correct data types using Python methods (e.g., Pandas `dtypes`) to ensure accurate analysis.
* Convert data to appropriate numerical types (e.g., integer, float) for proper calculations in statistical models.

**Data Normalization**

* Normalize data to make features comparable and mitigate biases in statistical models.
* Apply techniques like Feature Scaling, Min-Max Scaling, and Z-Score Normalization in Python using Pandas methods.

**Data Binning**

* Utilize binning to group numerical data into categories (bins) for improved model accuracy and data understanding.
* Implement binning in Python with NumPy's `linspace` function for creating bins and Pandas' `cut` function for assigning data points to bins.
* Leverage histograms to visualize the distribution of binned data to gain insights into feature behavior.

**One-Hot Encoding**

* Convert categorical variables (e.g., "fuel type") into numerical representations suitable for machine learning models.
* Employ one-hot encoding to create dummy variables in Python using Pandas' `get_dummies` method. Each dummy variable represents a category within the original feature.

By mastering these techniques, you can effectively prepare your data for machine learning, leading to more robust and accurate models.

# Practice Quiz: Data Wrangling

**Assignment details**
- **Submitted:** June 10, 12:43 AM +07Jun 10, 12:43 AM +07
- **Attempts:** Unlimited
- **Your grade:** To pass you need at least 75%. We keep your highest score.
  **Grade:** 100%

## Practice Quiz: Data Wrangling
**Practice Assignment • 12 min**

### 1. Question 1
What is the correct syntax to access a column, say "symboling,” from a dataframe, say df?

1 point

- [ ] `df.get("symboling")`
- [ ] `df=="symboling"`
- [ ] `df=”symboling”`
- [x] `df["symboling"]`

### 2. Question 2
How would you change the name of the column "city_mpg" to "city-L/100km"?

1 point

- [x] `df.rename(columns={"city_mpg": "city-L/100km"})`
- [ ] `df.rename(columns={"city_mpg": "city-L/100km"}, inplace=True)`
- [ ] `df.columnheader(columns={"city_mpg": "city-L/100km"}, inplace=True)`
- [ ] `df.columnname={"city_mpg": "city-L/100km"})`

### 3. Question 3
What is the primary purpose of normalization?

1 point

- [x] To make the range of the values consistent and make comparing and analyzing values easier
- [ ] So all the variables have a similar influence on the models you build
- [ ] To get rid of “not a number” or NaN values
- [ ] It brings data into a common standard of expression

### 4. Question 4
Why do we convert categorical variables into numerical values?

1 point

- [ ] To save memory
- [x] Most statistical models require numerical values
- [ ] It makes it easier to visualize the data
- [ ] It makes it easier to fill in missing data



# Practice Quiz: Data Wrangling

## Question 1
Which of the following methods should you use to replace a missing value of an attribute with continuous values?

1 point

- [x] Use the average of the other values in the column
- [ ] Use the difference between the minimum and maximum values of the other data in the column
- [ ] Use the mean square error of the other data in the column
- [ ] Use an educated guess

## Question 2
Which of the following helps you decide on bin values when pre-processing data?

1 point

- [x] Visualize the distribution using a histogram
- [ ] Divide the average by the standard deviation
- [ ] Convert objects to ints
- [ ] Use the interquartile range

## Question 3
Which of the following data types should numbers with decimals be if you want to use them as input for training a statistical model?

666, 1.1, 232, 23.12

1 point

- [ ] int
- [ ] data frame
- [x] float
- [ ] object

## Question 4
Which of the following is the primary purpose of simple feature scaling?

1 point

- [ ] It brings data into a common standard of expression
- [x] To make comparing and analyzing values easier.
- [ ] To get rid of “not a number” or NaN values
- [ ] So all the variables have a similar influence on the models you build

## Question 5
Which of the following is the primary purpose of the get_dummies() method?

1 point

- [ ] Converts the data’s data type
- [ ] To help you group your data into bins
- [x] Converts categorical values into numerical ones
- [ ] Converts numerical values into categorical ones
```