# Manipulating and Creating Columns

> During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up to 80% of a Data Scientists time.
>
> \- Wes McKinney, the creator of Pandas

## Applied Review

### Data Structures and DataFrames
- We use **DataFrames** to represent tables in Python.

- Python also support other data structures for storing information that isn't tabular. Examples include lists and dictionaries.

- DataFrames have many **methods**, or functions that access or modify their internal data. Some examples we saw were `describe()` and `set_index()`.

- DataFrames are composed of **Series**, 1-dimensional data structures of homogenous type

### Selecting and Filtering Data
- Python's pandas library supports limiting rows (via *filtering* and *slicing*), as well as *selecting* columns.

- For selecting colums, we use _just the brackets_ `df[]`. For all operations involving rows, we use the `df.loc[]` location *accessor*.

* `.loc` also supports selecting columns via the `df.loc[rows,cols]` syntax

* Note: You could even use `.loc` to _only_ select colums by writing `df.loc[:,cols]`, where `:` stands for "all elements" but in this case `df[cols]` is a better choice.

## Calculations Using Columns

It's common to want to modify a column of a DataFrame, or create a new column.
To demonstrate this let's take a look at our planes data again.

In [None]:
import pandas as pd
planes = pd.read_csv('../data/planes.csv')

In [None]:
planes.head()

Suppose we wanted to know the total capacity of each plane, including the crew.
We have data on how many seats each plane has (in the `seats` column), but that only includes paying passengers.



In [None]:
seats = planes['seats']
seats.head()

For simplicity, let's say a full flight crew is always 5 people.
Series objects allow us to perform addition with the regular `+` syntax –- in this case, `seats + 5`.

In [None]:
capacity = seats + 5
capacity.head()

So we've created a new Series, `capacity`, with the total carrying capacity of the plane.  

Right now this new Series is totally separate from our original `planes` DataFrame, but we can make it a column of `planes` using the **assignment syntax**, `=`, with the **column reference syntax**, `[]`.
```python
df['new_column_name'] = new_column_series
```

In [None]:
def highlight(row_or_col: pd.Series):
    labels_to_highlight = ['capacity']
    if row_or_col.name in labels_to_highlight:
        return ['background-color: lightblue']*len(row_or_col)
    else:
        return ['background-color: white']*len(row_or_col)

In [None]:
planes['capacity'] = capacity
planes.head().style.apply(highlight)

Note that `planes` now has a "capacity" column at the end.

Also note that in the code above, the *column name* goes in quotes within the bracket syntax, while the *values that will become the column* -- the Series we're using -- are on the right side of the statement

This sequence of operations can be expressed as a single line:

In [None]:
# Create a capacity column filled with the values in the seats column added with 5.
planes['capacity'] = planes['seats'] + 5

From a mathematical perspective, what we're doing here is adding a *scalar* -- a single value -- to a *vector* -- a series of values (aka a `Series`).
Other vector-scalar math is supported as well.

In [None]:
# Subtraction
planes['seats'] - 12

In [None]:
# Multiplication
planes['seats'] / 10

In [None]:
# Exponentiation
planes['seats'] ** 2

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Erstelle eine neue Variable `first_class_seats`, die 1/5 der verfügbaren Plätze darstellt.  
_Tipp: Um Dezimalstellen im Ergebnis zu vermeiden, kann der "floor division" Operator `//` genutzt werden._

## Overwriting Columns


What if we discovered a systematic error in our data?
Perhaps we find out that the "engines" column is only the number of engines *per wing* -- so the total number of engines is actually double the value in that column.

We could create a new column, "real_engine_count" or "total_engines".
But we're not going to need the original "engines" column, and leaving it could cause confusion for others looking at our data.

A better solution would be to **replace the original column** with the new, recalculated, values.
We can do so using the **same syntax as** for **creating a new column**.

In [None]:
planes.head()

In [None]:
# Multiply the engines column by 2, and then overwrite the original data.
planes['engines'] = planes['engines'] * 2

In [None]:
planes.head()

## Calculating Values Based on Multiple Columns

So far we've only seen vector-scalar math.
But vector-vector math is supported as well.

Let's look at a toy example of creating a column that contains the **number of seats per engine**.

In [None]:
seats_per_engine = planes['seats'] / planes['engines']
seats_per_engine.head()

In [None]:
planes['seats_per_engine'] = seats_per_engine
planes.head()

You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways.

In [None]:
planes['nonsense'] = (planes['year'] + 12) * planes['engines'] + planes['seats'] - 9
planes.head()

Note that the normal _precedence rules_ for mathematical operators hold when working with dataframes. So we place `planes['year'] + 12` in parentheses to ensure it happens before the multiplication.

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Erstelle eine neue Variable `technology_index`, die sich wie folgt berechnet:  
`technology_index = (year-1900) / 4 + engines * 2`  
_Note: Remember the order of operations!_
2. Lade das Movies Dataset und erzeuge eine neue Variable `profit` die aus `gross - budget` berechnet wird.

<font class="your_turn">
    Your Turn
</font>

Create a new column in the planes DataFrame, "technology_index", that is calculated with the formula:

`technology_index = (year-1900) / 4 + engines * 2`

Remember order of operations!

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>


1. Lade das Airbnb Dataset (`../data/airbnb.csv`) in einen DataFrame mit dem Namen `airbnb`.
2. Verschaffe dir einen ersten Eindruck über die Daten (z.B. via das `.shape` Attribut und die `.head()` und `.tail()` Methoden). Wie viele Einträge enthält das Dataset? Wie viele Variablen gibt es?
3. Filtere die Daten, so dass nur die Einträge vom Typ "Apartment" angezeigt werden. Speichere das Ergebnis als neuen DataFrame zwischen.
4. Erstelle eine neue Variable `price_per_person`, die den Preis pro Person angibt. (Unter der Annahme dass die Unterkunft voll ausgebucht ist.) <br>
_Tipp: Die `accomodates` Variable gibt Auskunft darüber, wie viele Personen in einer Unterkunft übernachten können._
5. Filtere die Daten und finde alle Unterkünfte, die ein Rating höher als 90 haben, und zudem Platz für mindestens 4 Gäste. Wie viele Unterkünfte erfüllen diese Kriterien?
6. Erstelle einen neuen Datensatz, der die Variable `id`, als Index nutzt und zudem nur die Variabeln `property_type`, `bedrooms`, `price` und `rating` enthält.
7. Bonus: Erforsche den Datensatz nach eigenem Interesse weiter. Und versuche dabei die bisher gelernten Python & Pandas Kenntnisse anzuwenden.

#<div style="color: white"> 
df = pd.read_csv('../data/airbnb.csv')
df.head()
filt = df['property_type'] == "Apartment"
df[filt]
#</div>

In [None]:
#

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

## Non-numeric Column Operations

So far we have seen mathematical operations on _numeric values_. <br>  
Of course, pandas supports string operations as well.

We can use `+` to concatenate strings, with both vectors and scalars.

In [None]:
summary = 'Tailnum is ' + planes['tailnum'] + ' and Model is ' + planes['model']
summary.head()

More complex string operations are possible using methods available through the `.str` *accessor*. There are _many_, so we won't cover them all.  

You can refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str) for a full overview of available methods if you are interested. The general usage pattern is always `df['col_name'].str.method_name()`:

In [None]:
# Make the manufacturer field lowercase.
lowercase_manufacturer = planes['manufacturer'].str.lower()
lowercase_manufacturer.head()

In [None]:
# Get the length of the manufacturer name
manufacturer_len = planes['manufacturer'].str.len()
manufacturer_len.head()

## More Complex Column Manipulation

### Mapping Values

One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a **one-to-one correspondence** between the _values currently in the column_ and the _new values that should replace them_.

This operation can be described as **"mapping one set of values to another"**.

Let's look at an example of this.

In [None]:
airlines = pd.read_csv('../data/airlines.csv')
# Keep just the first 5 rows for this example.
airlines = airlines.loc[0:4]
airlines

Suppose we learn that there is a mistake in the carrier codes and they should be updated.
- 9E should be PE
- B6 should be BB
- The other codes should stay as they are.

We can express this *mapping* of old values to new values using a Python dictionary.

In [None]:
value_mapping = {'9E': 'PE',
                 'B6': 'BB'}
# The format is always {old_value:new_value}
# Values which aren't in the dictionary won't be affected

Pandas provides the `.replace` method that accepts this value mapping and updates the Series accordingly.

We can use it to create a new column, "updated_carrier", with the proper carrier code values.

In [None]:
def highlight(labels_to_highlight):
    def highlight_wrapped(row_or_col: pd.Series):
        if row_or_col.name in labels_to_highlight:
            return ['background-color: lightblue']*len(row_or_col)
        else:
            return ['background-color: white']*len(row_or_col)
    return highlight_wrapped

In [None]:
value_mapping = {'9E': 'PE',
                 'B6': 'BB'}
airlines['updated_carrier'] = airlines['carrier'].replace(value_mapping)
airlines.style.apply(highlight([0,3]), 1)

### The `apply` Method and Beyond

If you can think of a way to express a new column as a combination of other columns and constants, it can easily be created using the methods we have seen so far.

If you need to perform some very complex or specialised operations on your data, the `apply` method allows to execute arbitrary code _on each_ element of a DataFrame or Series. If you wish to learn more, take a look at the [`DataFrame.apply` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html). Note however that executing custom code on a per element level like this will incur a performance cost compared to the vectorized operations offered by Pandas. 

Here is a contrived example just to illustrate that **anything** can be achieved:

In [None]:
df = pd.read_csv('../data/movies.csv', keep_default_na=False)

In [None]:
def arbitrarily_complex_operation(element):
    uppercase_element = element.upper()
    length_of_element = len(element)
    length_of_element_squared = length_of_element ** 2
    
    if len(element.split()) > 1:
        last_name = element.split()[-1]
    else:
        last_name = "Name Unknown"
    
    return f"{uppercase_element}, Name length: {length_of_element} ----- {last_name}"

In [None]:
df['director_name'].apply(arbitrarily_complex_operation)

Please DON'T do this, unless absolutely necessary! :-)

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Lade den Wetter Datensatz (`../data/weather.csv`) in einen DataFrame mit dem Namen `weather`.
2. Sieh dir die Variable `month` genauer an. Über den Befehl `weather.dtypes` kannst du herausfinden, dass die Werte in dieser Variablen als Integers abgelegt sind. Kannst du dir vorstellen, wie die Zahlen mit den Monaten in Verbindung stehen?
3. Schreibe Code um ein "Mapping" von jeder Zahl zum zugehörigen Monat vorzunehmen. Nutze hierfür ein Dictionary. (z.B. `{1: 'Januar', ...}`). Speichere das Dictionary in einer Variablen `month_mapping`.
4. Verwende die `.replace` Methode um die aktuelle Repräsentation der Monate mit den Namen der Monate zu überschreiben. 

<font class="your_turn">
    Your Turn
</font>

1. Open the weather CSV (path: `../data/weather.csv`) and store it in a variable called `weather`.
2. Take a look at the "month" column. Observe that its values are numeric, not strings. How do you think these values relate to months of the year?
3. Create a mapping from each number to the corresponding month name, as a dictionary. For example, one of the keys would be `5` and its value would be `May`. Store it in a variable called `month_mapping`.
4. Use the `.replace` method to overwrite the current month column with the month names as strings, using your newly created mapping.

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>