# Change data types and add columns

## Import pandas

In [36]:
import pandas as pd

## Import data

In [37]:
# URL of data
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/height_clean.csv"

In [38]:
df = pd.read_csv(URL)

In [39]:
df.dtypes

name                       object
id                          int64
height                      int64
average_height_parents    float64
gender                     object
dtype: object

## Change data type

- There are several methods to [change data types in pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html):

The most common method to change the data type is:

- `.astype()`: Convert to a specific type (like "int32", "float" or "catgeory")
- `.astype(str)`: Convert to string  
  
More options:  
  
- `to_datetime`: Convert argument to datetime.
- `to_timedelta`: Convert argument to timedelta.
- `to_numeric`: Convert argument to a numeric type.
- `numpy.ndarray.astype`: Cast a numpy array to a specified type.

### Categorical data

- Categoricals are a pandas data type corresponding to categorical variables in statistics. 


- A categorical variable takes on a limited, and usually fixed, number of possible values (categories). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.


- Convert variable "gender" to a category dtype:

Hint:
    
```python    
df["___"] = df["___"].astype("___")
````


In [40]:
### BEGIN SOLUTION
df["gender"] = df["gender"].astype("category")
### END SOLUTION

In [41]:
"""Check if your code returns the correct output"""
assert df.dtypes.value_counts().category == 1

### String data

- In our example, `id` is not a number (we can't perform calculations with it)
- It is just a unique identifier so we should transform it to a simple string (object)

Hint:

```python
df['___'] = df['___'].___(___)
```


In [42]:
df['id'] = df['id'].astype(str)

df.dtypes

name                        object
id                          object
height                       int64
average_height_parents     float64
gender                    category
dtype: object

## Add new columns

### Constant

- Add a new variable called "number" to df 
- The new variable should have the number 42 in all rows


Hint:
  
```python
df["___"] = ___
```

In [43]:
### BEGIN SOLUTION
df["number"] = 42
### END SOLUTION

In [44]:
df.head()

Unnamed: 0,name,id,height,average_height_parents,gender,number
0,Stefanie,1,162,161.5,female,42
1,Peter,2,163,163.5,male,42
2,Stefanie,3,163,163.2,female,42
3,Manuela,4,164,165.1,female,42
4,Simon,5,164,163.2,male,42


In [45]:
"""Check if your code returns the correct output"""
assert df.iloc[0, 5] == 42

### From existing

 - Create new columns from existing columns

In [47]:
# we use numpy to add some data with a nornmal distribution 
import numpy as np

# calculate height in m (from cm)
df['height_m'] = df.height/100 

# add 20 random numbers with a mean of 45 and standard deviation of 5
df['weight'] = round(np.random.normal(45, 5, 20) * df['height_m'],2)

# calculate body mass index
df['bmi'] = round(df.weight / (df.height_m * df.height_m),2)

In [48]:
df.head()

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi
0,Stefanie,1,162,161.5,female,42,1.62,84.58,32.23
1,Peter,2,163,163.5,male,42,1.63,70.57,26.56
2,Stefanie,3,163,163.2,female,42,1.63,75.48,28.41
3,Manuela,4,164,165.1,female,42,1.64,75.46,28.06
4,Simon,5,164,163.2,male,42,1.64,83.24,30.95


### Date

- To add a date, we can use datetime and [strftime](https://strftime.org):

In [49]:
# we need datetime to add a date
from datetime import datetime

df["date"] = datetime.today().strftime('%Y-%m-%d')

df.head(3)

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
0,Stefanie,1,162,161.5,female,42,1.62,84.58,32.23,2022-10-08
1,Peter,2,163,163.5,male,42,1.63,70.57,26.56,2022-10-08
2,Stefanie,3,163,163.2,female,42,1.63,75.48,28.41,2022-10-08
