# Change column names

## Import pandas

- To load the pandas package and start working with it, import the package. 


- The community agreed alias for pandas is `pd`, so loading pandas as pd is assumed standard practice for all of the pandas documentation:

In [1]:
import pandas as pd

## Import data

In [2]:
# URL of data
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/height_unclean.csv"
df = pd.read_csv(URL, sep=";", decimal=',')
df

Unnamed: 0,Name,ID%,Height,Average Height Parents,Gender
0,Stefanie,1,162,161.5,female
1,Peter,2,163,163.5,male
2,Stefanie,3,163,163.2,female
3,Manuela,4,164,165.1,female
4,Simon,5,164,163.2,male
5,Sophia,6,164,164.4,female
6,Ellen,7,164,164.0,female
7,Emilia,8,165,165.2,female
8,Lina,9,165,165.2,female
9,Marie,10,165,165.1,female


## Change column names

Usually, we prefer to work with columns that have the following proporties:


- no leading or trailing whitespace (`"name"` instead of `" name "`, `" name"` or `"name "`)


- all lowercase (`"name"` instead of `"Name"`)


- now white spaces (`"my_name"` instead of `"my name"`)

### Simple rename

- First, we rename columns by simply using a mapping
- We rename `"Name"` to `"name"` (we want to display errors and don't save the changes for now).


Hint:

```python
df = df.rename(columns={"OLD_NAME": "NEW_NAME"}, errors="raise")
```

In [3]:
### BEGIN SOLUTION
df = df.rename(columns={"Name": "name"}, errors="raise")
### END SOLUTION

In [4]:
"""Check if your code returns the correct output"""
assert df.loc[0, 'name'] == "Stefanie"

In [5]:
df.head(2)

Unnamed: 0,name,ID%,Height,Average Height Parents,Gender
0,Stefanie,1,162,161.5,female
1,Peter,2,163,163.5,male


- Let`s rename Gender to gender


- Here, we just want to display the result (without saving it).


- Remove the # and run the following code:

In [None]:
# df.rename(columns={"Gender": "gender"}, errors="raise")

- This raises an error. 


- Can you spot the problem? Take a look at the end of the error statement and describe the type of error. How could you fix the problem?

- The KeyError statement tells us that `"['Gender'] not found in axis"`


- This is because variable Gender has a white space at the beginning: `[ Gender]`


- We could fix this problem by typing `" Gender"` instead of `"Gender"`


- However, there are useful functions (regular expressions) to deal with this kind of problems

### Trailing and leading spaces (with regex)

- We use regular expressions to deal with whitespaces


- To change multiple column names in `df` at once, we use the method `df.columns = df.columns.str___` 

- To replace the spaces, we use `.replace()` with `regex=True`

- In the following function, we search for leading (line start and spaces) and trailing (line end and spaces) spaces and replace them with an empty string:

Hint:

replace r"*this pattern*" with empty string r""


```python
df.columns = df.columns.str.replace(r"___ | ___", r"", regex=True)
```

Explanation for *regex* (see also [Stackoverflow](https://stackoverflow.com/a/67466222)):

- we start with `r` (for raw) which tells Python to treat all following input as raw text (without interpreting it)
- "`^`": is line start
- " ": is a white space
- "`+`": some following characters
- "`|`": is or
- "`$`": is line end
- "": is an empty string 


To learn more about regular expressions ("regex"), visit the following sites:

- [regular expression basics](https://www.w3schools.com/python/python_regex.asp).
- [interactive regular expressions tool](https://regex101.com/)

In [6]:
### BEGIN SOLUTION
df.columns = df.columns.str.replace(r"^ +| $", r"", regex=True)
### END SOLUTION

In [7]:
"""Check if your code returns the correct output"""
assert df.columns.tolist() == ['name', 'ID%', 'Height', 'Average Height Parents', 'Gender']

In [8]:
df.columns

Index(['name', 'ID%', 'Height', 'Average Height Parents', 'Gender'], dtype='object')

### Replace special characters

- Again, we use regular expressions to deal with special characters (like %, &, $ etc.)

Replace "%" with an empty string

Hint:
    
```python
df.___ = df.columns.str.___(r"___", r"", regex=True)
```

In [9]:
### BEGIN SOLUTION
df.columns = df.columns.str.replace(r"%", r"", regex=True)
### END SOLUTION

In [10]:
"""Check if your code returns the correct output"""
assert df.columns.tolist() == ['name', 'ID', 'Height', 'Average Height Parents', 'Gender']

In [12]:
df.columns

Index(['name', 'ID', 'Height', 'Average Height Parents', 'Gender'], dtype='object')

### Lowercase and whitespace

We can use two simple methods to convert all columns to lowercase and replace white spaces with underscores ("_"):

- `.str.lower()`


- `.str.replace(' ', '_')`

Hint:

```python
___.___ = ___.___.___.___().___.___(' ', '_')
```

In [13]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [14]:
"""Check if your code returns the correct output"""
assert df.columns.tolist() == ['name', 'id', 'height', 'average_height_parents', 'gender']

In [15]:
df.columns

Index(['name', 'id', 'height', 'average_height_parents', 'gender'], dtype='object')