# <center> Data Formating
    
When working with data, we often want to format and clean our datasets. In this section, we will see how to apply basic transformations on columns.

# Native Functions

## Convert Column Types

### To Datetime

Already seen in the previous section

```python
pd.to_datetime(df['datetime_column'], format='%Y-%m-%d')
```

### Apply formating

We can convert a serie using the method 
```python
.astype(str)
```

In [None]:
import pandas as pd
import numpy as np
import json

In [None]:
df = pd.read_csv('Resources/data.csv')
df.head()

In [None]:
df.dtypes

In [None]:
df.age.astype(str)

In [None]:
df.dtypes

If we want to replace the serie by the new converted serie, we need to redefine it:

In [None]:
df.age = df.age.astype(str)
df.dtypes

In [None]:
df.age

**Question: What happens if we want to convert it to integer?**

In [None]:
df.age.astype(int)

In [None]:
int('32.0')

In [None]:
int(float('32.0'))

## Working with null values

Sometimes, we want to get rid of null values in a dataframe. We can either remove the rows with null values, or fill the null values with a specific value.

**Important**: With pandas, null values are defined as `NaN` object. This is different from `None`.

In order to evaluate if values are None/Null, we should use the function `pd.notnull()` or `pd.isnull()`.



In [None]:
np.nan

In [None]:
pd.notnull(np.nan)

In [None]:
pd.isnull(np.nan)

In [None]:
np.nan is None

In [None]:
if np.nan:
    print("Not null")

In [None]:
if None:
    print("Not null")

### pd.DataFrame.dropna()

In [None]:
df = pd.read_csv('Resources/data.csv')

In [None]:
df.dropna()

In [None]:
df.dropna(how='all')

In [None]:
df.dropna(subset=['name'])

The `inplace=True` argument replace the dataframe by the results.

In [None]:
df.dropna(subset=['age'], inplace=True)
df

###  pd.DataFrame.fillna()

In [None]:
df = pd.read_csv('Resources/data.csv')
df.fillna(0)

We can also replace the value by the average of value.

In [None]:
age_mean = df.age.mean()
df.fillna(age_mean)

# Custom Functions

Until now, we have seen native transformation functions. But you can also define your own transformations.

## Apply to a Serie

First, let's create a function.

In [None]:
def convert_age_to_category(age):
    if age < 10:
        return "Children"
    elif age < 40:
        return "Young"
    else:
        return "Even younger"

In [None]:
df = pd.read_csv('Resources/data.csv')

In order to apply a function to a Serie, we use the method `apply`, that takes as an argument the function.

```python
df['column'].apply(function)
```

In [None]:
df['age'].apply(convert_age_to_category)

We can create a new column that stores the result:

In [None]:
df['age_category'] = df['age'].apply(convert_age_to_category)
df

### Lambda Functions - Very Small Explanation

In python, lambda functions are used to define function as one-liner.

In [None]:
age_cat_func = lambda x: "Young" if x < 20 else "Not so young"
age_cat_func

In [None]:
age_cat_func(15)

### How to apply it to series?

In [None]:
df['age_category_lambda'] = df['age'].apply(lambda x: "Young" if x < 10 else "Not so young")
df

## Exercice

Create a new column "age_int" that converts the age into an integer.

1. First, write the function.
2. Then apply it to the column.

In [None]:
# TO DO

## Apply to a dataframe

We can also imagine a transformation that takes as an argument multiple columns and returns an output.

In [None]:
df = pd.read_csv('Resources/data_concat.csv')

In [None]:
df

Let's create a function that takes 2 arguments: first and last name, and returns the full name.

In [None]:
def generate_full_name(first_name, last_name):
    return f"{first_name} {last_name}"

generate_full_name("Andy", "Barakat")

To apply the function to a dataframe, we use the apply function as well, with a lambda. `x` represents the row. 
x['column_name'] is the value the column_name of the row `x`.

In [None]:
df['full_name'] = df.apply(lambda x: generate_full_name(x['first_name'], x['last_name']), axis=1)
df

**Another Approach**

Another approach is to create a function that takes as argument the entire row, and select the columns in the function.

In [None]:
def generate_full_name(row):
    return f"{row['first_name']} {row['last_name']}"

row = df.loc[0]
generate_full_name(row)

In [None]:
df['full_name'] = df.apply(generate_full_name, axis=1)
df

## Exercice

1. Create a function that create a date "YYYY-MM-DD" from a day, month and year.
2. Apply the function to the dataframe to create the column.

_Hint_: Try ```'%02d' % int(5)```, to add a leading zero if needed. For example:

In [None]:
integer = 8
'%02d' % int(integer)

In [None]:
def convert_to_date(day, month, year):
    """
    Convert to "YYYY-MM-DD"
    """


# Helpful examples: parsing dictionnary

In [None]:
def get_attribute_from_dict(my_dict, attribute):
    return my_dict.get(attribute)

df = pd.read_csv('Resources/data_embed.csv', sep=';')
df

In [None]:
df['views_dict'] = df['views'].apply(lambda x: json.loads(x))
df.dtypes

In [None]:
type(df.loc[0]['views'])

In [None]:
type(df.loc[0]['views_dict'])

In [None]:
df['number_listings'] = df['views_dict'].apply(lambda x: len(x['listings']))
df