# Deeper Dive on DataFrames

Now that we understand objects and functions better, let's take a closer look at DataFrames.

In [None]:
import pandas as pd
df = pd.read_csv('../data/movies.csv')
df.head(2)

## What Are DataFrames Made of?

Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets.

In [None]:
director_name_column = df['director_name']
director_name_column

Individual columns are pandas `Series` objects.

In [None]:
type(director_name_column)

How are Series different from DataFrames?

- They're always 1-dimensional

- They have slightly different attributes than DataFrames
    - For example, Series have a `to_list` method -- which doesn't make sense to have on DataFrames

- They don't print in the pretty format of DataFrames, but in plain text (see above)

In [None]:
director_name_column.shape

In [None]:
df.shape

In [None]:
director_name_column.to_list()

In [None]:
# If we try the same conversion to list on the full DataFrame we get an error
df.to_list()

It is important to be familiar with Series because they form the core of DataFrames. **Every column** of a DataFrame **is internally a Series object**.

In [None]:
# Fetch another column of the DataFrame
year_column = df["year"]
year_column

In [None]:
# Verify the type of the column
type(year_column)

Whenever you select individual columns (or rows), you'll get Series objects.

### What Can You Do with a Series?

First, let's create our own Series object from scratch -- they don't need to come from a DataFrame.

In [None]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s

In [None]:
# Pass a list in as an argument and it will be converted to a Series.
s = pd.Series([10, 20, 30, 40, 50])
s

There are 3 things to notice about this Series:

- The *values* (10, 20, 30...)

- The *dtype*, short for data type.

- The *index* (0, 1, 2...)

#### Values
Values are fairly self-explanatory; we provided them via our input list.

#### dtype
Data types are also straightforward.

Series are always homogeneous, that is holding only data of the same type. For example holding only integers, floats, or generic Python objects (called just `object`).

Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically be of type `object`.

For example, going back to our Movies DataFrame, note that the director_name is of type `object`.

In [None]:
df['director_name']

#### Index
Indexes are more interesting.
Every Series has an index, **a way to reference each element**.
The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element.

In fact, if we create a Series from a _dictionary_ instead of a list that is exactly what happens. The dictionary keys will become the Index labels:

In [None]:
sample_dict = {"arno": "green", "moritz": "blue"}
pd.Series(sample_dict)

In [None]:
# But let's stay with our series of numbers for now
s = pd.Series([10, 20, 30, 40, 50])
s

In [None]:
# Our index is a range from 0 (inclusive) to 5 (exclusive).
s.index

In [None]:
s

In [None]:
# We can access a value at a specific index
s[3]

In our example, the index is just the integers 0-4, so right now it looks no different than referencing elements of a regular Python list.
*But* indexes can be changed to something different -- like the letters a-e, for example.

In [None]:
s.index = ['a', 'b', 'c', 'd', 'e']
s

Now to look up the value 40, we reference `'d'`.

In [None]:
s['d']

We mentioned earlier that the rows of a DataFrame can also be obtained as a Series.
In such cases, the flexibility of Series indexes comes in handy;
the index is set to the DataFrame column names.

In [None]:
df.head(2)

In [None]:
# Note that the index corresponds to the column names
first_row = df.loc[0]
first_row

This is particularly handy because it means you can extract individual elements based on a column name.

In [None]:
first_row['director_name']

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Finde einen Film der dich interessiert und speichere den Datensatz (die Row) für den Film in einer Variablen z.b. mit dem (abgekürzten) Namen des Films. <br> *Tipp: Nutze hierfür die `movies.loc[0]` Syntax. Wir schauen uns die genaue Funktionsweise hiervon bald genauer an.*
2. Lese den Direktor, die zwei Haupt-Schauspieler und die Länge des Film (in Minuten) programmatisch aus dem im ersten Schritt gespeicherten Objekt aus. 
3. Bonus: Lese weitere Filme/Informationen, die dich interessieren aus. Speichere einige davon als eigene Variablen und versuche diese mit dem `print` Statement auszugeben. Du kannst hierfür auch mit "F-Strings" in der Form `f"Der Direktor des Film {title_variable} ist {director_variable}"` experimentieren.


In [None]:
movies = pd.read_csv('../data/movies.csv')
movies.head(2)

#<span style="color: white">
pirates = movies.loc[1]
director = pirates['director_name']
print(director)
#F-String Syntax
print(f"Direktor: {director}, Name des Film: {pirates['title']}")
#</span>

## DataFrame Indexes

It's not just Series that have indexes!
DataFrames have them too.
Take a look at the movies DataFrame again and note the bold numbers on the left.

In [None]:
df.head()

These numbers are an index, just like the one we saw on our example Series.
And DataFrame indexes support similar functionality.

In [None]:
# Our index is a range from 0 (inclusive) to 4914 (exclusive).
df.index

When loading in a DataFrame, the **default index** will always be 0 to N-1, where N is the number of rows in your DataFrame.
This is called a `RangeIndex`.

Selecting individual rows by their index is done with the `.loc` accessor.
An *accessor* is an attribute designed specifically to help users reference contents of an object (like rows within a DataFrame).

In [None]:
# Get the row at index 4 (the fifth row).
df.loc[4]

As with Series, DataFrames support reassigning their index.

With DataFrames it often makes sense to change one of your columns into the index.

This is analogous to a _primary key_ in relational databases: An unambigious identifier for each record that also serves as a way to quickly look up rows within a table.

In our case, maybe we will often use the movie title (`title`) to look up the details for a movie.
In that case, it would make sense to set the movie title column as our index.

In [None]:
df = df.set_index('title')
df.head()

Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing the movie title to the `.loc` accessor.

In [None]:
df.loc['The Matrix']

<font style="color:#800;">
    <strong>Caution</strong>:<br><em>Pandas does not require that indexes have unique values (that is, no duplicates) although most relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it!</em>
</font>

### Bonus: How to find a movie if we don't know the exact name?

In [None]:
# To search for a specific movie you can use this little snippet
search_term = "Harry Potter"
[title for title in df.index if search_term in title]

In [None]:
# Another sample search
search_term = "Ring"
[title for title in df.index if search_term in title]

### Quick Demonstration of Control Flow in Python

There are two important constructs of control flow you should know about:

**for-loops**  
To iterate over _each_ element of a sequence/in a container

**if-else statements**  
To execute some code based on _whether_ another _condition holds true_

In Python the usage of these is intuitive since it reads almost like a sentence.

In [None]:
# Create a list
list_of_participants = ['julia', 'luca', 'ivetta', 'zouhair', 'maximilian', 'daniel']
print(list_of_participants)

In [None]:
# For-Loop
for name in list_of_participants:
    print(name)

In [None]:
# For-Loop + Conditional Statement
for name in list_of_participants:
    if 'u' in name:
        print(name.capitalize())

In [None]:
# For-Loop + Conditional Statement
for name in list_of_participants:
    if 'u' in name:
        print(name.capitalize())

In [None]:
# For-Loop + Conditional Statement
for name in list_of_participants:
    if 'u' in name:
        print(name.capitalize())
    else:
        print("---")

Now let's apply the same concepts to our movie titles:

In [None]:
# Create a list
list_of_movie_titles = df.index.to_list()

In [None]:
# For-Loop + Conditional Statement
for movie_title in list_of_movie_titles:
    if 'Harry' in movie_title:
        print(movie_title)

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

### Back to DataFrames
When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it directly in the beginning.

This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index.

If you want to change the index of your DataFrame later, you can always `reset_index` (and then assign a new one).

In [None]:
df.head(2)

In [None]:
# Reset the index (to default RangeIndex)
df = df.reset_index()
df.head(2)

In [None]:
# But for now, let's keep the title column as index
df = df.set_index('title')
df.head(2)

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

Die folgende Code Zelle lädt einen Datensatz mit Amerikanischen Flughäfen und speichert diesen als `airports`. Die Daten enthalten den nationalen Flughafencode (Federal Aviation Administration Code), Namen des Flughafen und einige weitere Informationen wie die Breiten- und Längengrade und Zeitzone des Flughafen.  

1. Welche Art von Index hat der `airports` DataFrame aktuell?
2. Ist dieser Index eine gute Wahl? Falls nicht, welche Attribute wären hierfür besser geeignet und warum?
3. Weise dem DataFrame via die `df.set_index()` Methode einen neuen Index zu.
4. Jetzt, mit dem neuen Index, selektiere den "San Francisco Intl" Flughafen mit dem Code SFO. Auf welcher Höhe (altitude), im Vergleich zum Meeresspiegel, befindet sich der Flughafen?
5. Probiere einmal den Index wieder auf den Standard Index zurück zu setzen.

<font class="your_turn">
    Your Turn
</font>

The below cell has code to load in airports data as `airports`.
The data contains the airport code, airport name, and some basic facts about the airport location.

1. What kind of index is the current index of `airports`? 
2. Is this a good choice for the DataFrame's index? If not, what column or columns would be a better candidate?
3. If you chose a different column to be the index, make it your index using `airports.set_index()`.
4. Using your new index, look up "Pittsburgh-Monroeville Airport", code 4G0. What is its altitude?
5. Reset your index in case you want to make a different column your index in the future.

In [None]:
airports = pd.read_csv('../data/airports.csv')
airports.head()

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

----

### Stretch-Goal: On Consistency and Language Design

One of the great things about Python is that its creators really cared about internal consistency.

What that means to us, as users, is that syntax is consistent and predictable -- even across different uses that would appear to be different at first.

Dot notation reveals something quite special about Python: packages are just like other objects, and the variables inside them are just attributes and methods!

This standardization across packages and objects helps us remember a single, intuitive syntax that works for many different things.

Examples:  
- The same `[]` selection syntax works for all Python objects that implement it
    - `my_list[5]`
    - `my_dictionary['Microsoft']`
    - `my_dataframe['The Matrix']`
    - `my_series[2]`
- Built-in methods like `type()`, `len()`, `print()` work the same accross all suitable objects
- For-loops work the same way for lists, pd.Series and other collection objects. They even work for strings.
    