# Data Processing and Transformation


Data processing and transformation tasks are the foundation of many data engineering activities. This involves a variety of operations such as:

<ul>
    <li><b>Cleaning:</b> Correcting or removing any inaccuracies or errors in the data.</li>
    <li><b>Filtering:</b> Selecting a subset of the data based on certain conditions.</li>
    <li><b>Aggregation:</b> Combining multiple data points into a single data point (e.g., summing, averaging).</li>
</ul>


## Tools and libraries for data processing and transformation in Python


`Python` offers several libraries that are widely used for data processing and transformation tasks:

<ol>
    <li><b>Pandas</b> is a popular library for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data.</li>&nbsp;
    <li><b>NumPy</b> is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.</li>&nbsp;
    <li><b>Dask</b> is a flexible library for parallel computing in Python that makes it easy to build intuitive workflows for ingesting, filtering, and manipulating data.</li>&nbsp;
</ol>    


<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />


### Using `Pandas` for 

#### 1. Data Cleaning

Let's consider a simple example where we clean a dataset using Pandas.

<pre><code class="language-python">
<font color="blue">import</font> pandas <font color="blue">as</font> pd

# Load the dataset
df = pd.read_csv(<font color="green">'data.csv'</font>)

# Fill NA/NaN values using the specified method
df.fillna(<font color="magenta">0</font>, inplace=<font color="blue">True</font>)

# Remove duplicates
df.drop_duplicates(inplace=<font color="blue">True</font>)
</code></pre>


In this code snippet, we first load the dataset using the pd.read_csv() function. Then, we replace all NA/NaN values with 0 using the `df.fillna()` function. Finally, we remove duplicate rows using the `df.drop_duplicates()` function.


Another example

<pre><code class="language-python">
<font color="blue">import</font> pandas <font color="blue">as</font> pd

# Load the dataset
df = pd.read_csv(<font color="green">'data.csv'</font>)

# Replace "unknown" values with NaN
df.replace(<font color="green">"unknown"</font>, pd.NaT, inplace=<font color="blue">True</font>)

# Drop rows with any NaN in the 'age' and 'income' columns
df.dropna(subset=[<font color="green">'age'</font>, <font color="green">'income'</font>], inplace=<font color="blue">True</font>)

# Replace outlier ages above 100 with the median age
median_age = df[<font color="green">'age'</font>].median()
df.loc[df[<font color="green">'age'</font>] > <font color="magenta">100</font>, <font color="green">'age'</font>] = median_age

# Convert income column to numeric, errors to NaN
df[<font color="green">'income'</font>] = pd.to_numeric(df[<font color="green">'income'</font>], errors=<font color="green">'coerce'</font>)
</code></pre>

In this snippet, we first replace "unknown" values with Pandas' representation for missing data, NaN. We then drop rows with any missing age or income data. We also replace ages over 100 (assumed to be outlier data) with the median age. Finally, we ensure the income column is numeric, with any conversion errors resulting in NaN values.


#### 2. Filtering

<pre><code class="language-python">
<font color="blue">import</font> pandas <font color="blue">as</font> pd

# Load the dataset
df = pd.read_csv(<font color="green">'data.csv'</font>)

# Filtering the data where 'age' is greater than 30
df_filtered = df[df[<font color="green">'age'</font>] > <font color="magenta">30</font>]

# Filtering data where 'income' is greater than 50000 and 'age' is less than 50
df_filtered = df[(df[<font color="green">'income'</font>] > <font color="magenta">50000</font>) &amp; (df[<font color="green">'age'</font>] &lt; <font color="magenta">50</font>)]
</code></pre>


#### 3. Aggregation

<pre><code class="language-python">
<font color="blue">import</font> pandas <font color="blue">as</font> pd

# Load the dataset
df = pd.read_csv(<font color="green">'data.csv'</font>)

# Calculate the mean 'income' by 'age'
df_grouped = df.groupby(<font color="green">'age'</font>)[<font color="green">'income'</font>].mean()

# Calculate the total 'income' and average 'spending' by 'region'
df_grouped = df.groupby(<font color="green">'region'</font>).agg({<font color="green">'income'</font>: <font color="green">'sum'</font>, <font color="green">'spending'</font>: <font color="green">'mean'</font>})
</code></pre>

These are the basics of filtering and aggregation with Pandas, which are integral to many data processing and transformation tasks. The filtering allows you to select subsets of your data that meet certain criteria, while aggregation functions let you compute summary statistics over different groups in your data.


### Using `numpy` for

#### 1. Cleaning

Data cleaning with `NumPy` is possible but less straightforward and typically more difficult than using Pandas, especially when dealing with real-world data. If the data is purely numerical without any missing values, NumPy could still be used, but Pandas is generally the better option for data cleaning tasks.


#### 2. Filtering
<pre><code class="language-python">
<font color="blue">import</font> numpy <font color="blue">as</font> np

# Create an array
arr = np.array([<font color="magenta">1</font>, <font color="magenta">2</font>, <font color="magenta">3</font>, <font color="magenta">4</font>, <font color="magenta">5</font>])

# Filtering the array where value is greater than 3
filtered_arr = arr[arr > <font color="magenta">3</font>]
</code></pre>


#### 3. Aggregation
<pre><code class="language-python">
<font color="blue">import</font> numpy <font color="blue">as</font> np

# Create an array
arr = np.array([<font color="magenta">1</font>, <font color="magenta">2</font>, <font color="magenta">3</font>, <font color="magenta">4</font>, <font color="magenta">5</font>])

# Calculate the mean of the array
mean_arr = np.mean(arr)

# Calculate the sum of the array
sum_arr = np.sum(arr)
</code></pre>

The key takeaway is that while you can use NumPy for these data processing tasks, it's typically more efficient and easier to use a library like Pandas that's specifically designed for this type of work. `NumPy`'s strength lies more in numerical computations on arrays and matrices, rather than general data processing and analysis.


