## **Chapter 2:** Data Manipulation

For the next section we use a dataset from an injection molding experiment. The data is only used to demonstrate other ways for getting a dataframe. A research paper is available that details the experiments undertaken to obtain the data: 

> Polenta, A.; Tomassini, S.; Falcionelli, N.; Contardo, P.; Dragoni, A.F.; Sernani, P. A Comparison of Machine Learning Techniques for the Quality Classification of Molded Products. Information 2022, 13, 272. https://doi.org/10.3390/info13060272

Please download the `csv` data [here](https://github.com/airtlab/machine-learning-for-quality-prediction-in-plastic-injection-molding/blob/master/dataset/data.csv) and save it in your `/data` folder.



### **[2.1 Data Importing](#2.1-data-importing)**
<a name="2.1-data-importing"></a>
Pandas supports a wide range of data formats for reading and writing data, making it incredibly versatile for data analysis tasks. 

You can easily import data from `CSV`, `Excel`, `JSON`, `SQL` databases, and many other formats into Pandas DataFrames. 

Here's how to import data from a CSV file:

In [None]:
# Import data with pandas read CSV function
df = pd.read_csv('../data/data.csv')

# In the next chapters (2.2, 2.3 and 2.4), we are going to use more simple example data.
# For the coding challenge (2.5), we will utilize the injection modling dataset. 

The data import is similar for every format pd.read_dataformat, e.g.:
- pd.read_csv
- pd.read_json
- pd.read_excel
- pd.read_html
- pd.read_xml
- pd.read_sql

For more information about data import you can use the official pandas documentaion ([click here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)).

### **[2.2 Data Exploration](#2.2-data-exploration)**
<a name="2.2-data-exploration"></a>
Once you have your data loaded into a DataFrame, Pandas offers several functions to perform initial data exploration. 

These functions help to understand the structure and content of your data:
- name_of_the_dataframe **.head()**
- name_of_the_dataframe **.describe()**
- name_of_the_dataframe **.dtypes()**
- name_of_the_dataframe **.index()**
- name_of_the_dataframe **.columns()**

In [None]:
# Creating a new DataFrame to explore 
import numpy as np

# Create some dummy data using NumPy
dummy_data = np.random.randn(20,4)

# Create a dataframe with dummy data 
df = pd.DataFrame(data=dummy_data)

df 

> Why do we see something, even though we did not print the dataframe?

When you simply place the name of a DataFrame (or any object for that matter) at the end of a Jupyter notebook cell, Jupyter uses the _repr_html_() method of the DataFrame object to render it as an `HTML table`. This method provides a visually appealing, styled HTML representation of the DataFrame, which includes features like a boxed layout, background shading for the header, and potentially interactive elements like scrollbars for wide or long DataFrames.

On the other hand, when you use the `print(df)` statement, the DataFrame is converted to a string representation using the `__str__()` method. This results in a text-based output that looks more like traditional, console-based output. This output is less visually appealing than the HTML representation and lacks the formatting enhancements like background shading and borders.

The HTML representation is generally more readable, especially for large DataFrames, as it makes better use of the notebook's capabilities to present data in an interactive environment. However, if you need to share your DataFrame in a text format or prefer the simplicity of the text output for any reason, you might use `print(df)` instead.

In [None]:
# To make the data more interesting, we are going to add three series to df

# Create a new Series with categorical values
dummy_series_int = pd.Series(np.random.choice([0,1,2], size=20))
dummy_series_cat = pd.Series(np.random.choice(["a","b","c"], size=20))
dummy_series_bool = pd.Series(np.random.choice([True, False], size=20))

# Add the Series to the DataFrame
df = df.assign(int_col=dummy_series_int, cat_col=dummy_series_cat, bool_col=dummy_series_bool)

df

The `.assign` method in pandas is a powerful tool for adding new columns to a `DataFrame` in a functional style, meaning it returns a *new* DataFrame without modifying the original one. This method is particularly useful for creating new columns that are derived from existing ones or for adding entirely new information to a DataFrame. You will see more ways to add columns to DataFrames in the next chapters.

**.head():** Display the first 5 rows of the DataFrame. You can change the number of rows from 5 to *n* with `.head(n)`. With `.head(-n)` every row is displayed except the last *n* rows.

In [None]:
# Display the first 5 rows of the DataFrame
df.head()  # Try to change the number of rows

**.describe()**: Display the summary statistics of the DataFrame's numeric columns. With the parameter `include='all'` all columns of the input will be included in the output, also non-numeric columns.

In [None]:
# Display the summary statistics of the DataFrame's numeric columns
df.describe()  # Try with parameter include='all'

**.dtypes():** Display the data types of each column. Columns with mixed types are stored with the 'object' dtype.

In [None]:
# Display the data types of each column
df.dtypes

In pandas, the data types (dtypes) of columns specify the kind of data each column contains. Understanding these dtypes is crucial for efficient data manipulation and analysis.
* The `float64` dtype represents floating-point numbers in pandas, stored using 64 bits of memory. A floating-point number is one that can contain fractions (decimal points), used for more precise calculations or when dealing with real numbers. The `64` in `float64` indicates the precision level of the floating-point data, with 64 bits offering double precision. This is the standard floating-point dtype in pandas and is capable of representing a wide range of decimal values. It's particularly useful for numerical analysis, scientific computations, and any situation where fractional numbers are common.

* The `int32` dtype in pandas represents integer numbers, using 32 bits of memory. Unlike floating-point numbers (`float64`), integers are whole numbers without a decimal point. The 32 in int32 indicates that each integer value is stored using 32 bits. This allows for a range of values from `-2,147,483,648` to `2,147,483,647`. The choice of `int32` over, say, `int64`, may be motivated by memory considerations, as `int32` uses half the memory of `int64`. However, it's important to ensure that the data you're working with fits within the `int32` range to avoid overflow errors. In pandas, integer columns are very common for indexing, counts, and any other numerical data that doesn't require a fractional component.

* The `object` dtype in pandas is used for columns that contain data of mixed types or non-numeric data. Most commonly, object dtype columns store text data (strings), but they can also hold complex objects like lists, dictionaries, or even other DataFrames if you have a particularly complex dataset. An `object` dtype is essentially a catch-all for columns that do not fit into other specific dtypes like `int64`, `float64`, or `bool`. Because object dtypes can store various types of data, operations on these columns might be less efficient, and performing numerical operations directly on an object column without first converting it to a more specific type can lead to errors or unexpected results.

* The `bool` dtype in pandas is used for columns that contain Boolean values: `True` or `False`. It is the pandas equivalent of Python's built-in bool type and represents a binary state. Boolean columns are extremely useful in data filtering operations, where they can be used directly as masks to select subsets of data that meet certain conditions. For example, a bool column might be used to flag certain rows based on whether they meet a specific criterion, such as whether a value in another column exceeds a certain threshold. In memory, `bool` types are very efficient because they only need to represent two possible states.

When working with pandas, managing your DataFrame's dtypes is essential for both performance and accuracy. Converting columns to the most appropriate dtypes can significantly speed up data processing and ensure that operations like aggregations, sorting, and mathematical computations are performed correctly.

**.index()**: Display the index of the DataFrame.

In [None]:
#  Display the index of the DataFrame
df.index

* RangeIndex: This is a type of index that pandas uses. It's an optimized form of index for cases where the index is equivalent to range(start, stop, step). It's particularly memory-efficient because, instead of storing all index values, it only needs to store the start, stop, and step values.
    * `start=0`: This is the starting value of the index. It indicates that the index starts at 0.
    * `stop=50`: This is the stopping value of the index, but it's exclusive. That means the actual last index value is one less than this number. So, in this case, the last index would be 49 (since indexing is zero-based).
    * `step=1`: This indicates the increment between each index value. A step of 1 means the index values are sequential integers.

In [None]:
# Pandas allows you to create dataframe with any index you like
df_idx = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], index=["A", "B", "B"], columns=["X", "Y", "Z"])
df_idx

**.columns()**: Display the name of the columns of the DataFrame.

In [None]:
# Display the index of the DataFrame
df.columns

### **[2.3 Data Indexing](#2.3-data-indexing)**
<a name="2.3-data-indexing"></a>
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as *Subset Selection*.

Data Indexing is important to:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set. 

In [None]:
# First, we create a new example dataframe from a dictionary to work with 

data = {
    "city": ["Dortmund", "Kassel", "Kassel", "Dortmund", "Kassel", "Dortmund"],
    "population": [587010, 201585, 191854, 539050, 204202, 591672],
    "year": [2019, 2019, 2023, 2021, 2022, 2023]
}

df = pd.DataFrame(data)

df

#### Simple indexing

* **Select columns:** With the name of one selected column or more selected columns as a list.

In [None]:
# Select a column via its name

df["city"]

In [None]:
# Select multiple columns by providing a list 
df[["city", "population"]]

#### Simple slicing

* **Select rows:** With slicing operation similar to NumPy. 
    * Use `[:n]` to get the first n rows. 
    * Use `[n:]` without the first n rows. 
    * Use `[::n]` every n row.

In [None]:
# Get the first three rows
df[:3]

In [None]:
# Get the last three rows
df[-3:]

In [None]:
# Get every third row
df[::2]

#### Label-based selection with `loc`

The `loc` method is used for selecting rows and columns based on their labels. It allows for both, selecting specific rows and columns by their labels and slicing ranges of rows or columns.

In [None]:
# Select a single row by index label
df.loc[0]

In [None]:
# Select multiple rows by index labels
df.loc[[3, 4]]

In [None]:
# Select rows by index label range
df.loc[2:5]

In [None]:
# Select specific rows and columns
df.loc[[2, 3], ['city', 'population']]


#### Positional Indexing with `iloc`

The `iloc` method is used for selecting rows and columns based on their integer positions. The positions are taken to be integer indices, so they start at `0`. It’s similar to Python *list slicing*, and as such, it’s exclusive of the endpoint.

In [None]:
# Select a single row by position
df.iloc[0]

In [None]:
# Select multiple rows by position
df.iloc[1:5]  # Selects rows 0 to 4

In [None]:
# Select specific rows and columns by position
df.iloc[[0, 2, 3], [1, 2]]  # Rows 0, 2, 3 and columns 1, 3

In [None]:
# Using a boolean array
df.iloc[[True, False, True, False, True, True]]

**Summary:** In pandas, both `.loc[]` and `.iloc[]` are used for indexing and selecting data from DataFrames, but they work in fundamentally different ways due to their indexing criteria: labels for `.loc[]` and integer positions for `.iloc[]`. 

Understanding the difference between `.loc[]` and `.iloc[]` is crucial for correctly accessing data within pandas objects.

* Label-based Indexing: `.loc[]` is used to select rows and columns by their labels.
* Positional Indexing: `.iloc[]` is used to select rows and columns by their integer position (i.e., their physical location in the DataFrame).

### **[2.4 Data Cleaning](#2.4-data-cleaning)**
<a name="2.4-data-cleaning"></a>
Data cleaning is a crucial step before the actual analysis. 

Pandas provides several methods to deal with duplicate data, missing data and data transformation:

In [None]:
# First, we create a new example dataframe from a dictionary to work with 

data = {
    "city": ["Dortmund", "Kassel", "Dortmund", "Kassel", "Dortmund", "Kassel", "Kassel", "Dortmund", "Kassel", "Dortmund"],
    "population": [587010, 201585, None, None, 591672, 191854, 230011, 539050, 204202, 591672],
    "year": [2019, 2019,2020, 2020, 2023, 2023, 2023, 2021, 2022, 2023]
}

df = pd.DataFrame(data)

df

#### Remove duplicate values

Removing duplicates from a DataFrame is an essential step in data cleaning to ensure the accuracy of your data analysis. Pandas provides two main functions to handle duplicate data: `duplicated()` and `drop_duplicates()`. Understanding how to use these functions allows for effective identification and removal of duplicate rows in your dataset.

* The `duplicated()` function is used to identify duplicate rows in a DataFrame. By default, it returns a *boolean* Series indicating whether each row is a duplicate of a row encountered earlier in the DataFrame. You can specify how to consider duplicates with the `keep` parameter.


In [None]:
# Find duplicates
df_copy = df.copy()
df_duplicates = df.duplicated()

df_duplicates

* The `drop_duplicates()` function removes duplicate rows from a DataFrame. Like `duplicated()`, it offers flexibility in defining what constitutes a duplicate.

In [None]:
# Removing duplicate entries
df_raw = df.copy()
df_copy = df.drop_duplicates()

df_copy

#### Handle missing values

Handling missing values is a critical aspect of data cleaning that directly impacts the *quality of data analysis*. Missing data can arise for various reasons, such as errors in data collection, processing, or transmission. 

**Pandas** provides several methods to deal with missing values effectively: `isnull()`, `notnull()`, `dropna()`, and `fillna()`. Each of these methods plays a specific role in identifying, removing, or replacing missing values in a `DataFrame` or `Series`.

* Using `isnull()` helps to too identify missing values in a DataFrame or Series. It returns a boolean same-sized object indicating if the elements are `NaN` or `None`.

In [None]:
# Example use of isnull()
df.isnull()

* `notnull()` can be considered the oposite of `isnull()`, since it identifies non-missing values. It also returns a boolean object of the same size, but indicates if the elements are **not** `NaN` or `None`.

In [None]:
# Example use of notnull()
df.notnull()

* You can use `dropna()` to remove missing values from a `DataFrame` or a `Series`. 
    * The parameter `axis` allows you to specifies whether *rows* (`0` or `'index'`) or *columns* (`1` or `'columns'`) should be dropped.


In [None]:
# Dropping the rows
df_copy = df_copy.dropna(axis=0)

In [None]:
# Dropping the columns
df_copy = df.copy()
df_copy.dropna(axis=1)

* You can use `fillna()`to replaces missing values with a specified value, or by using an imputation method such as `mean`, `median`, or `mode`.

In [None]:
# Filling NaN with a string
df_copy = df.copy()
df_copy.fillna(value="Value to fill")

In [None]:
# Filling NaNs using a imputation method
df_copy = df.copy()
df_copy.fillna(value={"population": df_copy["population"].mean()})

#### Working with Data Types

Converting data types in a pandas DataFrame is a common task during the data cleaning process, ensuring that each column has the most appropriate type for data analysis or machine learning models. This process involves first inspecting the current data types of each column and then using the `astype()` method to explicitly convert columns to different types.

* Before converting data types, it's crucial to understand the current data types of the columns in your DataFrame. Pandas provides the `dtypes` attribute for this purpose.

In [None]:
# Inspecting Data Types with dtypes
df.dtypes

* Once you've identified the columns whose data types need to be converted, you can use the `astype()` method to perform the conversion. This method is versatile, allowing for the conversion to most pandas data types and even custom types.

In [None]:
# Converting Data Types with astype()
df_copy = df.copy()
df_copy = df_copy.fillna(42)
df_copy = df_copy.astype({'population': 'int32', 'year': 'category'})

df_copy.dtypes

In this example, the population column is converted to a `32-bit` integer to potentially reduce memory usage, and the `year` column is converted to a category type, which is often useful for columns with a relatively small number of unique values and can lead to performance improvements in certain operations.

### **[2.5 Coding Challenge](#2.5-coding-challenge)**

For this coding challenge, you'll be working with a dataset of your choice. Your task is to import the data into a Pandas DataFrame, explore the data to understand its structure, clean the data as necessary, and perform some basic analysis. This exercise will test your ability to manipulate and analyze data using Pandas.

Challenge objectives:

1. Import a dataset into a Pandas DataFrame.
2. Use DataFrame methods to explore the data (e.g., `head()`, `describe()`, `dtypes`).
3. Clean the dataset by handling missing values and duplicates.

To give an idea of how the task can be solved you can find an example with analysing a dataset about "Machine Learning for Quality Prediction in Plastic Injection Molding". The real-world dataset was collected during the production process and is used for a "systematic comparison of ML techniques to predict the quality classes of plastic molded products" 

> Polenta, A.; Tomassini, S.; Falcionelli, N.; Contardo, P.; Dragoni, A.F.; Sernani, P. A Comparison of Machine Learning Techniques for the Quality Classification of Molded Products. Information 2022, 13, 272. https://doi.org/10.3390/info13060272


Again, you can download the dataset and get more information [here](https://github.com/airtlab/machine-learning-for-quality-prediction-in-plastic-injection-molding).


**Step 1: Import a dataset into a Pandas DataFrame.**
- Load the data using pandas import to beginn the analysis.


In [None]:
# Step 1: Import a dataset into a Pandas DataFrame

# Sample code template
import pandas as pd
# Import the dataset
df = pd.read_csv('../data/data.csv', delimiter=";")

df.head()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
0,106.476184,80.617,7.124,3.16,74.83,886.9,904.0,116.9,104.3,145.6,922.3,8.82,18.73,1.0
1,105.505,81.362,6.968,3.16,74.81,919.409791,935.9,113.9,104.9,145.6,930.5,8.59,18.73,1.0
2,105.505,80.411,6.864,4.08,74.81,908.6,902.344823,120.5,106.503496,147.0,933.1,8.8,18.98,1.0
3,106.474827,81.162,6.864,3.16,74.82,879.410871,902.033653,127.3,104.9,145.6,922.3,8.85,18.73,1.0
4,106.46614,81.471,6.864,3.22,74.83,885.64426,902.821269,120.5,106.7,145.6,917.5,8.8,18.75,1.0


**Step 2: Use DataFrame methods to explore the data**
- Apply head(), describe() and dtypes()

In [None]:
# Step 2: Use DataFrame methods to explore the data

df.shape

(1451, 14)

In [None]:
df.head()

In [None]:
df.describe()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
count,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0
mean,106.89204,81.326023,7.459043,3.234173,75.218794,901.974834,919.351778,116.716747,104.163904,146.230048,900.972846,8.808863,18.756285,2.461751
std,5.615773,0.428813,1.688106,0.34323,0.432761,11.098192,10.780023,5.029085,4.802195,0.804894,25.519215,0.097238,0.095528,1.123611
min,81.747,78.409,6.084,2.78,74.78,876.7,894.8,94.2,76.5,144.8,780.5,8.33,18.51,1.0
25%,105.9145,81.1235,6.292,3.0,74.82,893.6,914.4,114.2,103.55,145.6,886.65,8.77,18.71,1.0
50%,106.089,81.327,6.968,3.1926,74.83,902.4,918.8,116.9,105.2,146.1,906.8,8.82,18.75,2.0
75%,106.263,81.441,7.124,3.29,75.65,909.4,926.3,120.2,106.531415,146.7,918.9,8.85,18.79,4.0
max,155.032,82.159,11.232,6.61,75.79,930.6,946.5,130.3,114.9,150.5,943.0,9.06,19.23,4.0


In [None]:
df.dtypes

Melt temperature                                    float64
Mold temperature                                    float64
time_to_fill                                        float64
ZDx - Plasticizing time                             float64
ZUx - Cycle time                                    float64
SKx - Closing force                                 float64
SKs - Clamping force peak value                     float64
Ms - Torque peak value current cycle                float64
Mm - Torque mean value current cycle                float64
APSs - Specific back pressure peak value            float64
APVs - Specific injection pressure peak value       float64
CPn - Screw position at the end of hold pressure    float64
SVo - Shot volume                                   float64
quality                                             float64
dtype: object

**Step 3: Clean the dataset**
- by handling missing values 
- by handling duplicate values
- by changing the datatype

In [None]:
# Step 3: Clean the dataset

# Find NaN
df.isna().any()


In [None]:
# Find duplicates
df.duplicated().any()

In [None]:
# Change the data type to 32-bit floats
df = df.astype("Float32")

df.dtypes

[--> Back to Outline](#course-outline)

---