# Data Transformation with Pandas

- **Data transformation** refers to the process of converting and altering data from one format, structure, or representation to another. It involves various operations that can be applied to raw or source data to prepare it for analysis, modeling, or visualization. 

- Data transformation is a crucial step in the data preprocessing pipeline, and it is used for several purposes, including:
    - **Data Integration:** combining data from multiple sources or systems, such as aligning data types, units of measurement, or date formats.
    - **Feature Engineering:** create new features or variables from existing data, suh as mathematical transformations, aggregations, or creating categorical variables.
    - **Data Aggregation:** aggregate data at different levels, such as summing, averaging, or counting values within specified groups.
    - **Normalization and Scaling:** bring data of different scales into a common range, making it suitable for algorithms sensitive to the scale of variables, such as many machine learning models.
    - **Encoding Categorical Data:** transform textual or labeled data into numerical values, such as one-hot encoding, label encoding, or ordinal encoding.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/billionairs.csv")
df.shape

(2640, 35)

## 1. Dropping, renaming, counting values

### 1.1. Drop

Drop entire column: Remove rows or columns by specifying label names and corresponding axis.

In [3]:
# Add an extra column
df["extra"] = np.nan
"extra" in df.columns

True

In [4]:
# Drop column "extra"
df.drop(columns="extra", inplace=True) # KeyError raised if there is no column "extra"
"extra" in df.columns

False

Drop multiple columns

In [5]:
df["extra1"] = np.nan
df["extra2"] = np.nan
print("Before:", "extra1" in df.columns, "extra2" in df.columns)

df.drop(columns=["extra1", "extra2"], inplace=True)
print("After:", "extra1" in df.columns, "extra2" in df.columns)

Before: True True
After: False False


Drop rows by index

In [6]:
df.tail(3)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
2637,2540,1000,Manufacturing,Zhang Gongyun,60.0,China,Gaomi,Tyre manufacturing machinery,Manufacturing,China,...,2.9,"$19,910,000,000,000",50.6,100.2,77.0,9.4,59.2,1397715000.0,35.86166,104.195397
2638,2540,1000,Real Estate,Zhang Guiping & family,71.0,China,Nanjing,Real estate,Real Estate,China,...,2.9,"$19,910,000,000,000",50.6,100.2,77.0,9.4,59.2,1397715000.0,35.86166,104.195397
2639,2540,1000,Diversified,Inigo Zobel,66.0,Philippines,Makati,Diversified,Diversified,Philippines,...,2.5,"$376,795,508,680",35.5,107.5,71.1,14.0,43.1,108116600.0,12.879721,121.774017


In [7]:
# Add new row
last_index = df.index[-1]
df.loc[last_index + 1] = df.iloc[last_index]
df.tail(3)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
2638,2540,1000,Real Estate,Zhang Guiping & family,71.0,China,Nanjing,Real estate,Real Estate,China,...,2.9,"$19,910,000,000,000",50.6,100.2,77.0,9.4,59.2,1397715000.0,35.86166,104.195397
2639,2540,1000,Diversified,Inigo Zobel,66.0,Philippines,Makati,Diversified,Diversified,Philippines,...,2.5,"$376,795,508,680",35.5,107.5,71.1,14.0,43.1,108116600.0,12.879721,121.774017
2640,2540,1000,Diversified,Inigo Zobel,66.0,Philippines,Makati,Diversified,Diversified,Philippines,...,2.5,"$376,795,508,680",35.5,107.5,71.1,14.0,43.1,108116600.0,12.879721,121.774017


In [8]:
# Drop last row
df.drop(index=last_index+1, inplace=True)
df.tail(3)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
2637,2540,1000,Manufacturing,Zhang Gongyun,60.0,China,Gaomi,Tyre manufacturing machinery,Manufacturing,China,...,2.9,"$19,910,000,000,000",50.6,100.2,77.0,9.4,59.2,1397715000.0,35.86166,104.195397
2638,2540,1000,Real Estate,Zhang Guiping & family,71.0,China,Nanjing,Real estate,Real Estate,China,...,2.9,"$19,910,000,000,000",50.6,100.2,77.0,9.4,59.2,1397715000.0,35.86166,104.195397
2639,2540,1000,Diversified,Inigo Zobel,66.0,Philippines,Makati,Diversified,Diversified,Philippines,...,2.5,"$376,795,508,680",35.5,107.5,71.1,14.0,43.1,108116600.0,12.879721,121.774017


### 1.2. Rename

Rename 1 column

In [9]:
# add new column
df["old_col"] = np.nan
df.columns

Index(['rank', 'finalWorth', 'category', 'personName', 'age', 'country',
       'city', 'source', 'industries', 'countryOfCitizenship', 'organization',
       'selfMade', 'status', 'gender', 'birthDate', 'lastName', 'firstName',
       'title', 'date', 'state', 'residenceStateRegion', 'birthYear',
       'birthMonth', 'birthDay', 'cpi_country', 'cpi_change_country',
       'gdp_country', 'gross_tertiary_education_enrollment',
       'gross_primary_education_enrollment_country', 'life_expectancy_country',
       'tax_revenue_country_country', 'total_tax_rate_country',
       'population_country', 'latitude_country', 'longitude_country',
       'old_col'],
      dtype='object')

In [11]:
df.rename(columns={"old_col": "new_col"}, inplace=True)
df.columns

Index(['rank', 'finalWorth', 'category', 'personName', 'age', 'country',
       'city', 'source', 'industries', 'countryOfCitizenship', 'organization',
       'selfMade', 'status', 'gender', 'birthDate', 'lastName', 'firstName',
       'title', 'date', 'state', 'residenceStateRegion', 'birthYear',
       'birthMonth', 'birthDay', 'cpi_country', 'cpi_change_country',
       'gdp_country', 'gross_tertiary_education_enrollment',
       'gross_primary_education_enrollment_country', 'life_expectancy_country',
       'tax_revenue_country_country', 'total_tax_rate_country',
       'population_country', 'latitude_country', 'longitude_country',
       'new_col'],
      dtype='object')

Rename multiple columns

In [14]:
df["new_col_2"] = np.nan
df.rename(columns={
    "new_col": "newCol",
    "new_col_2": "newCol2"
}, inplace=True)
df.columns

Index(['rank', 'finalWorth', 'category', 'personName', 'age', 'country',
       'city', 'source', 'industries', 'countryOfCitizenship', 'organization',
       'selfMade', 'status', 'gender', 'birthDate', 'lastName', 'firstName',
       'title', 'date', 'state', 'residenceStateRegion', 'birthYear',
       'birthMonth', 'birthDay', 'cpi_country', 'cpi_change_country',
       'gdp_country', 'gross_tertiary_education_enrollment',
       'gross_primary_education_enrollment_country', 'life_expectancy_country',
       'tax_revenue_country_country', 'total_tax_rate_country',
       'population_country', 'latitude_country', 'longitude_country', 'newCol',
       'newCol2'],
      dtype='object')

In [16]:
df.drop(columns=["newCol", "newCol2"], inplace=True)
df.columns

Index(['rank', 'finalWorth', 'category', 'personName', 'age', 'country',
       'city', 'source', 'industries', 'countryOfCitizenship', 'organization',
       'selfMade', 'status', 'gender', 'birthDate', 'lastName', 'firstName',
       'title', 'date', 'state', 'residenceStateRegion', 'birthYear',
       'birthMonth', 'birthDay', 'cpi_country', 'cpi_change_country',
       'gdp_country', 'gross_tertiary_education_enrollment',
       'gross_primary_education_enrollment_country', 'life_expectancy_country',
       'tax_revenue_country_country', 'total_tax_rate_country',
       'population_country', 'latitude_country', 'longitude_country'],
      dtype='object')

Rename index and multiple indices

In [17]:
df.index[:5]

Index([0, 1, 2, 3, 4], dtype='int64')

In [19]:
renamed_index = df.rename(index={0: "zero", 1: "one"})
renamed_index.index[:5]

Index(['zero', 'one', 2, 3, 4], dtype='object')

In [20]:
renamed_index.head(3)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
zero,1,211000,Fashion & Retail,Bernard Arnault & family,74.0,France,Paris,LVMH,Fashion & Retail,France,...,1.1,"$2,715,518,274,227",65.6,102.5,82.5,24.2,60.7,67059887.0,46.227638,2.213749
one,2,180000,Automotive,Elon Musk,51.0,United States,Austin,"Tesla, SpaceX",Automotive,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
2,3,114000,Technology,Jeff Bezos,59.0,United States,Medina,Amazon,Technology,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891


### 1.3. Count values

## 2. Sorting

- `sort_values()`: Sorts DataFrame or Series by specified columns.
- `sort_index()`: Sorts the DataFrame or Series by its index.

## 3. Grouping

- `groupby()`: Groups data based on one or more columns for aggregation.
- Aggregation functions like `sum()`, `mean()`, `count()`, and custom aggregation using `agg()`.

## 2. Merging data from different tables

- `concat()`: Concatenates DataFrames vertically or horizontally.
- `merge()`: Performs SQL-style joins based on specified columns.
- `join()`: Joins DataFrames using their index or a specified column.

## 3. Scaling and normalization

## 4. One hot encoding

**EXERCISE**