# Working with Python, Jupyter, and Data in Python

In this workbook, we will introduce some tools for handling data sets in a Python environment. 

As datasets often have to be preprocessed, cleaned, split, merged, or otherwise manipulated before an analytics pipeline an be started with a 'clean' dataset, these tools are needed in almost any data analytics process.

## Goals of this Section
- Have a basic understanding of Jupyter
- Getting familiar with the Python programming language
- Learning the very basics of `NumPy` and `Pandas`

**This course will not suffice to give an in-depth introduction into the Python programming language.
However, we will show you some basic concepts that should allow you to follow the lecture, create your own models based on given example, and evaluate and improve these models further.**

## Python

Python is a programming language widely used for data manipulation. 

Unfortunately, we will not have the time to go more deeply into what Python has to offer. However, we will show you some very elemental basics.

You can find a good Python tutorial here: https://www.w3schools.com/python/python_intro.asp

Let's start off with printing `Hello World!` below a cell.

### Hello World!

In [1]:
print('Hello World!')

Hello World!


In [2]:
# You can write comments inside code cells like this
print('Hello World!') # And you can append comments to a line like this
# Anything after the '#'-sign will be considered a comment, so the following print('Hello World') won't do anything here

Hello World!


You can always check the built-in documentation / help! Just append `?` to variable, function name, ...!

In [3]:
print?

[1;31mSignature:[0m [0mprint[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;34m' '[0m[1;33m,[0m [0mend[0m[1;33m=[0m[1;34m'\n'[0m[1;33m,[0m [0mfile[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mflush[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method

### Variables and datatypes

Variables are declared using a single `=`-sign.
Note: You do not need to declare the datatype with Python! It is automatically inferred from the values themselves.

In [4]:
a = 101
b = 'Hello World'
c = True

As you can see above, you do not need a semicolon `;` to end a statement, you can simply continue on a new line. (You *can* use semicolons, but this is typically not done.)

To print the value of a function, you'd normally use `print(...)`. However, Jupyter will automatically print the result of a cell:

In [5]:
a

101

---
### &#x270d; Exercise 

Execute the following cell. 

Add a new cell (remember how?).

Define a variable `c` that equals the sum of two values `a` and `b`. Print the result. Change the value of `b` and output the result again.

In [6]:
# YOUR CODE HERE
a = 2
b = 5
c = a + b
c

7

In [7]:
b = 2
c = a + b
c

4

---

There are a few datatypes you'll frequently see. We'll show you the most important ones. Use `type(...)` to find out, which datatype it is.

In [8]:
# integer
a = 1
type(a)

int

In [9]:
# float
b = 2.5
type(b)

float

In [10]:
# string
c = 'Hello World!'
type(c)
# note: there is no "character"-datatype -- that would just be a string of length 1

str

In [11]:
# bool(ean)
g = True
type(g)

bool

### Basic operators

Various operators are available for calculations:

- `+, -, *, /` behave as you most likely expect them to
- `()` can be used to control the order of operations
- `**`  is the power operator. $2^3$ is written as `2**3`

---
### &#x270d; Exercise 

Declare two variables: `a`, `b`.

Write a brief program that calculates the following formula: $c = a^2 - \sqrt{b}$.

Choose values for `a` and `b` such that $c=0$.

In [12]:
# YOUR CODE HERE
a = 3
b = 81
c = 3**2 - b**(0.5)
c

0.0

---

### Comparing variables

You'll often use the following comparators:
- `==` checks whether two values are equal
- `!=` checks whether two values are not equal
- `>` checks whether the left-hand value is greater than the right-hand value
- `<` checks whether the left-hand value is smaller than the right-hand value
- `>=` checks whether the left-hand value is greater than or equal to the right-hand value
- `<=` checks whether the left-hand value is smaller than or equal to the right-hand value

**Note: A single `=`-sign assigns a value to a variable whereas `==` compares two values!**

---
### &#x270d; Exercise 

Consider the example above. Write a brief statement that checks whether $c=0$ holds. 

In [13]:
# YOUR CODE HERE
c == 0

True

---

### Formatting code, if/else, and loops

Python also does not use brackets (`{...}`) to group statements together. Instead, it uses whitespaces:

In [14]:
i = 5
# Error below! Notice a single space at the start of the line
 print('Value is', i)

IndentationError: unexpected indent (1323090703.py, line 3)

We recommend you use 4 spaces at the beginning of an indentation.

This allows to write statements rather cleanly, e.g.:

In [15]:
number = 10
guess = int(input('Enter an integer : '))

if guess == number:
    print('How did you do that?')
elif guess < number:
    print('Too low!')
else:
    print('Too high!')

Enter an integer :  10


How did you do that?


`for`-loops can be created like this.

In [16]:
for i in range(5):
    print(i)

0
1
2
3
4


You can also state the start (inclusive) and end (exclusive) value explicitly:

In [17]:
for i in range (0, 5):
    print (i)

0
1
2
3
4


Moreover, you can specify the increments:

In [18]:
for i in range (0,5,3):
    print (i)

0
3


---
### &#x270d; Exercise 

Write a statement listing every *positive, even* value in the range from 1 to 10 (inclusive).

In [19]:
for i in range (2, 11, 2):
    print (i)

2
4
6
8
10



Write a loop that computes $\sum_{i=1}^{5}2i$. Output the after each increment together with the final result.

In [20]:
# YOUR CODE HERE

# Initialize the sum to 0
sum = 0

# Loop from 1 to 5 (inclusive)
for i in range(1, 6):
    # Add 2i to the sum
    sum += 2*i
    print (sum)

# Print the result
print("The sum is: ", sum)

2
6
12
20
30
The sum is:  30


Remember the exercise above. Again, please write a brief program that calculates the following formula: $c = a^2 - \sqrt{b}$.

Now, compare $c$ to `0` in an `if` statement and output the result of the comparison. 


In [21]:
# YOUR CODE HERE

# Step 1: Declare the variables
a = 2
b = 4

# Step 2: Calculate c
c = a**2 - b**(0.5)

# Step 3: Compare c to 0 and print the result
if c > 0:
    print("c is greater than 0")
elif c < 0:
    print("c is less than 0")
else:
    print("c is equal to 0")

c is greater than 0


---

### Functions
We'll be *using* functions a lot. But creating them is very simple, too. Here is how you'd do that.

In [22]:
def my_first_function():
    print('Hello World!')

In [23]:
type(my_first_function)

function

You can call the function by appending '()' to its name.

In [24]:
my_first_function()

Hello World!


In [25]:
def my_second_function(x):
    print('You wrote', x)

In [26]:
my_second_function('Hello World!')

You wrote Hello World!


Libraries in Python provide functions for you to use.
One example ist the `math` library, that, for example, offers a function to compute the square root of a variable:

In [27]:
import math
print (math.sqrt(9))

3.0


---
### &#x270d; Exercise 

Use the `math` library to define a function for computing the value of `c` as stated above.

Make use of the `sqrt` function!

Write a function encapsulating your if-else-statement. Check your result.

In [28]:
# YOUR CODE HERE 
def compute_c(a, b):
    return a**2 - math.sqrt(b)

In [29]:
def check_c(c):
    if c > 0:
        print("c is greater than 0")
    elif c < 0:
        print("c is less than 0")
    else:
        print("c is equal to 0")

In [30]:
a = 2
b = 4

c = compute_c(a,b)
check_c(c)

c is greater than 0


---

### Lists and Dictionaries

A list is a built-in data structure that can be used to store a collection of items. Lists are mutable, which means you can add, remove, or change elements after the list is created.

Lists are defined by enclosing a comma-separated sequence of objects in square brackets `[]`. The elements of a list can be of different types: integers, floats, strings, and even other lists.

Here's an example of a list in Python:

In [31]:
# list (also called 'array' in other languages)
d = [1, 2, 3]
type(d)

list

In [32]:
my_list = [1, 2, 'apple', 4.5]

You can access elements from the list by their index. Python uses zero-based indexing, so the first element is at index 0, the second element is at index 1, and so on. For example, `my_list[2]` would return `apple`.



The length of a list can be determined by the command `len`:

In [33]:
len(d)

3

In [34]:
len(my_list)

4

Dictionaries are another way of storing data.

A dictionary is a collection of key-value pairs. A dictionary is identified by curly brackets `{}`. 

The keys and values in a dictionary are separated by double dots `:`. 

In [35]:
# dict (key-value-structure)
e = {'key1': 1, 'key2': 'a string', 'key3': 2.5}
e

{'key1': 1, 'key2': 'a string', 'key3': 2.5}

New values can be assigned as follows:

In [36]:
e['key3'] = 4
e

{'key1': 1, 'key2': 'a string', 'key3': 4}

---
### &#x270d; Exercise 

Below, we have declared two lists. 

Create a new dictionary and store pairs of keys and values, i.e., fruits as keys and the respective number as value.

In [37]:
keys = ['apple', 'banana', 'cherry']
values = [1, 2, 3]

# YOUR CODE HERE

# Initialize an empty dictionary
fruit_dict = {}

# Merge the lists into a dictionary using a loop
# NOTE: This can be written more succinctly. The long example is used to illustrate the behavior of lists and dictionaries.
for index in range(0, len(keys)):
    key = keys[index]
    value = values[index]
    fruit_dict[key] = value

print(fruit_dict)  # Outputs: {'apple': 1, 'banana': 2, 'cherry': 3}

# More elegantly:
fruit_dict = dict(zip(keys, values))
# Or:
fruit_dict = {k : v for k, v in zip(keys, values)}
print(fruit_dict)

{'apple': 1, 'banana': 2, 'cherry': 3}
{'apple': 1, 'banana': 2, 'cherry': 3}


---

## Working with Data in Python

While "simple" Python is quite powerful (i.e., you can do about everything with the language already), we use dedicated libraries to work with data. 

These libraries offer functionalities that make data operations (a lot) easier and more efficient. 

First, we *import* the libraries that provide functionalities for handling and manipulating data sets:

In [38]:
import numpy as np
import pandas as pd

Note: The two libraries sometimes seem similar, but offer somewhat different functionalities.  `pandas` works on data structures called *DataFrames* and *Series*, which describe tabular (two-dimensional) data. `NumPy` is for working on multidimensional arrays. Depending on your data and on the tools you use, you will likely encounter both libraries.

## Basics of `NumPy`

`NumPy` provides functionality to work with n-dimensional data, e.g. "arrays".

Why would we need that? We have arrays (lists) in Python, don't we? Let's see.

In [39]:
vec1 = [1, 2, 3]
vec2 = [4, 5, 6]

In [40]:
# Now, let's add them:
vec1 + vec2

[1, 2, 3, 4, 5, 6]

As you can see, Python "added" them. But probably not in the way we intended! If you want *mathematical* operations on arrays (vectors), you need another library if you don't want to do it manually!

### Array creation

In this section, we focus on some tools for array manipulation. 

First, we create a one-dimensional array, i.e., a vector. 

In [41]:
x = np.array([3, 4, 5])
y = np.array([4, 9, 7])

You can do simple mathematical operations on these arrays just as you would do with "primitive" integer variables: 

In [42]:
x + y

array([ 7, 13, 12])

$n$-dimensional arrays can be represented as follows

In [43]:
x = np.array([[1, 2, 5], [3, 4, 6]])
x

array([[1, 2, 5],
       [3, 4, 6]])

### Array indexing and slicing

The `x` data array above is a $2 \times 3$, which we can verify using its `shape` attribute: 

In [44]:
x.shape

(2, 3)

You can access elements of this array as follows. Note that `NumPy` starts indexing with 0 (as do most programming languages).

As you can see, accessing both rows, columns and single values are possible:

In [45]:
x[0]

array([1, 2, 5])

---
### &#x270d; Exercise 

Execute the two following cells.

How does the indexing work?

In [46]:
x[1,1]

np.int64(4)

In [47]:
x[:,0]

array([1, 3])

Look at the results above and try to make sense of the commands and the results!

-------

### Basic array operations

`NumPy` offers various other functions that can be applied to the data objects. Try the following!

Note: This would of course not work with "standard" Python lists. That is why we use `NumPy`!

In [48]:
x.mean()

np.float64(3.5)

In [49]:
x.var()

np.float64(2.9166666666666665)

In [50]:
x[1].mean()

np.float64(4.333333333333333)

In [51]:
x.min()

np.int64(1)

In [52]:
# calculate the minimum of each column ("axis=0")
x.min(axis=0)

array([1, 2, 5])

In [53]:
# calculate the maximum of each row ("axis=1")
x.max(axis=1)

array([5, 6])

In [54]:
# element-wise calculation of the power of n
x**2

array([[ 1,  4, 25],
       [ 9, 16, 36]])

In [55]:
x.sort()
x

array([[1, 2, 5],
       [3, 4, 6]])

In [56]:
a = np.array([11, 11, 12, 13, 14, 15, 16, 17, 12, 13, 11, 14, 18, 19, 20])
unique_values = np.unique(a)
unique_values

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [57]:
unique_values, occurrence_count = np.unique(a, return_counts=True)
occurrence_count

array([3, 2, 2, 2, 1, 1, 1, 1, 1, 1])

In [58]:
unique_values

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [59]:
# check for each element whether the statement is True or False
a < 15

array([ True,  True,  True,  True,  True, False, False, False,  True,
        True,  True,  True, False, False, False])

In [60]:
# use the result of the element-wise check to filter the array and print only elements for which the statement is True
a[a < 15]

array([11, 11, 12, 13, 14, 12, 13, 11, 14])

---
### &#x270d; Exercise 

Create a NumPy array named `arr` from the list `[1, 2, 3, 4, 5]`.

Print the first element in the array.
Print the last element in the array.
Print the sum of all the elements in the array.
Print the mean of the array.

In [61]:
# YOUR CODE HERE
arr = np.array([1, 2, 3, 4, 5])
print(arr[0])  # First element
print(arr[-1])  # Last element
print(arr.sum())  # Sum of all elements
print(arr.mean())  # Mean of the array

1
5
15
3.0


---

Of course, there are many more operations, but this will give you an understanding of what is possible and what happens in the examples we show you in the following sessions.

Many "tedious" tasks can be simplified by `NumPy` functions. Just one example: Have a look at `arange`!

In [62]:
np.arange?

[1;31mDocstring:[0m
arange([start,] stop[, step,], dtype=None, *, device=None, like=None)

Return evenly spaced values within a given interval.

``arange`` can be called with a varying number of positional arguments:

* ``arange(stop)``: Values are generated within the half-open interval
  ``[0, stop)`` (in other words, the interval including `start` but
  excluding `stop`).
* ``arange(start, stop)``: Values are generated within the half-open
  interval ``[start, stop)``.
* ``arange(start, stop, step)`` Values are generated within the half-open
  interval ``[start, stop)``, with spacing between values given by
  ``step``.

For integer arguments the function is roughly equivalent to the Python
built-in :py:class:`range`, but returns an ndarray rather than a ``range``
instance.

When using a non-integer step, such as 0.1, it is often better to use
`numpy.linspace`.


Parameters
----------
start : integer or real, optional
    Start of interval.  The interval includes this value.  The d

In [63]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [64]:
np.arange(10,20)

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [65]:
np.arange(100,200,5)

array([100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160,
       165, 170, 175, 180, 185, 190, 195])

## Basics of `pandas`

`pandas` is a tool which is built "on top of" `NumPy` in the sense that it helps manipulating large data sets using mathematical operations. Let's first look at `Series`.

### pandas Series

A Series is a one-dimensional array of entries that can be created as follows:

In [66]:
x = pd.Series([1, 2, 3, 4])
x

0    1
1    2
2    3
3    4
dtype: int64

Compare this with a one-dimensional NumPy array below. What do you notice?

In [67]:
np.array([1, 2, 3, 4])

array([1, 2, 3, 4])

The comparison shows that pandas Series are *always indexed*.

Indices can also be set explicitly:

In [68]:
data = [10,11,12,13,14,15,16,17,18,19,20,21,22]
x = pd.Series(data,index=['a','b','c','k','d','f','g','h','i','i','e','k','s'])
x

a    10
b    11
c    12
k    13
d    14
f    15
g    16
h    17
i    18
i    19
e    20
k    21
s    22
dtype: int64

As you can see, indices can also be strings (or any other consistend data type). Think of such a Series as labeled data. 

These indices can also be used to access data explicitly:

In [69]:
x['b':'d']

b    11
c    12
k    13
d    14
dtype: int64

Why is this useful? See what happens when the two following Series objects are added. What does `fill_value` do?

In [70]:
x = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
y = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
x.add(y, fill_value = 0)

a     6.0
b     8.0
c     3.0
d    11.0
e     9.0
dtype: float64

`pandas` is a powerful tool for working with time series, too. Have a look at the following (more sophisticated) example.

In [71]:
# let's create an index for a time range
index = pd.date_range('2023-10-01', '2023-12-15')
index

DatetimeIndex(['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-04',
               '2023-10-05', '2023-10-06', '2023-10-07', '2023-10-08',
               '2023-10-09', '2023-10-10', '2023-10-11', '2023-10-12',
               '2023-10-13', '2023-10-14', '2023-10-15', '2023-10-16',
               '2023-10-17', '2023-10-18', '2023-10-19', '2023-10-20',
               '2023-10-21', '2023-10-22', '2023-10-23', '2023-10-24',
               '2023-10-25', '2023-10-26', '2023-10-27', '2023-10-28',
               '2023-10-29', '2023-10-30', '2023-10-31', '2023-11-01',
               '2023-11-02', '2023-11-03', '2023-11-04', '2023-11-05',
               '2023-11-06', '2023-11-07', '2023-11-08', '2023-11-09',
               '2023-11-10', '2023-11-11', '2023-11-12', '2023-11-13',
               '2023-11-14', '2023-11-15', '2023-11-16', '2023-11-17',
               '2023-11-18', '2023-11-19', '2023-11-20', '2023-11-21',
               '2023-11-22', '2023-11-23', '2023-11-24', '2023-11-25',
      

In [72]:
# then we create some random integers
# our data array contains the same number of elements as our index
data = np.random.randint(0, 100, len(index))
data

array([72, 80, 89, 97, 48,  7, 66, 23, 25, 54, 32, 24, 70, 14, 99, 21, 98,
       77, 23, 52, 51, 10, 30, 54, 94, 97, 95, 80, 34, 48, 44, 84,  0, 31,
       18, 50, 91, 14, 20, 81, 17, 51, 73,  4, 86, 56, 90, 16,  7, 47, 99,
       97, 16, 59, 82, 32, 97,  9, 66, 60, 83, 57, 64, 27, 72, 92, 91,  4,
       84, 68, 40, 48, 46, 45, 37,  9], dtype=int32)

In [73]:
# now, we 'merge' both to a time series
series = pd.Series(data=data, index=index)
series

2023-10-01    72
2023-10-02    80
2023-10-03    89
2023-10-04    97
2023-10-05    48
              ..
2023-12-11    48
2023-12-12    46
2023-12-13    45
2023-12-14    37
2023-12-15     9
Freq: D, Length: 76, dtype: int32

In [74]:
# pandas has some nice features to work with data
series.describe()

count    76.000000
mean     53.000000
std      30.414909
min       0.000000
25%      24.750000
50%      51.500000
75%      81.250000
max      99.000000
dtype: float64

In [75]:
series > 10

2023-10-01     True
2023-10-02     True
2023-10-03     True
2023-10-04     True
2023-10-05     True
              ...  
2023-12-11     True
2023-12-12     True
2023-12-13     True
2023-12-14     True
2023-12-15    False
Freq: D, Length: 76, dtype: bool

In [76]:
series.sum()

np.int64(4028)

In [77]:
# another very powerful feature is slicing (e.g. "selecting") data
# here we do it based on the date
series['2023-10-04':'2023-11-14']

2023-10-04    97
2023-10-05    48
2023-10-06     7
2023-10-07    66
2023-10-08    23
2023-10-09    25
2023-10-10    54
2023-10-11    32
2023-10-12    24
2023-10-13    70
2023-10-14    14
2023-10-15    99
2023-10-16    21
2023-10-17    98
2023-10-18    77
2023-10-19    23
2023-10-20    52
2023-10-21    51
2023-10-22    10
2023-10-23    30
2023-10-24    54
2023-10-25    94
2023-10-26    97
2023-10-27    95
2023-10-28    80
2023-10-29    34
2023-10-30    48
2023-10-31    44
2023-11-01    84
2023-11-02     0
2023-11-03    31
2023-11-04    18
2023-11-05    50
2023-11-06    91
2023-11-07    14
2023-11-08    20
2023-11-09    81
2023-11-10    17
2023-11-11    51
2023-11-12    73
2023-11-13     4
2023-11-14    86
Freq: D, dtype: int32

---
### &#x270d; Exercise 
Create a pandas Series named `s` from the list `[1, 2, 3, 4, 5].`
Print the elements in the Series that are greater than 2. This is working similar to a NumPy array.


In [78]:
# YOUR CODE HERE 
s = pd.Series([1, 2, 3, 4, 5])
print(s[s > 2])  # Outputs: 3, 4, 5

2    3
3    4
4    5
dtype: int64


---

### Pandas Data Frames

DataFrames are two-dimensional arrays that extend the Series object. We can create a DataFrame based on several Series objects with consistend indices.

As we have seen above: A dict(ionary) is a Python data structure which maps several keys to their counterpart values.

In [79]:
# we create on series for the population of some states
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [80]:
# then we create a series for the area of some states
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [81]:
# finally, we "merge" them to one data frame (which you can look at as a table)
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Simply put: A DataFrame is somewhat like a table which you can manipulate in Python using different libraries!

---
### &#x270d; Exercise 

Consider the following data frame:

In [82]:
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9] })
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


Print the sum of all the elements in column 'A'.

Print the mean of the elements in column 'B'.

In [83]:
# YOUR CODE HERE
print(df['A'].sum())  # Sum of all elements in column 'A'
print(df['B'].mean())  # Mean of the elements in column 'B

6
5.0


---

### Pandas data operations

Working with Pandas objects is often quite intuitive.

You can select single columns from our data frame!

In [84]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

pandas comes with a lot of tools. For example, if you want to sort by the values of one column, you can use:

In [85]:
pd.DataFrame.sort_values(states, by='population')

Unnamed: 0,population,area
Illinois,12882135,149995
Florida,19552860,170312
New York,19651127,141297
Texas,26448193,695662
California,38332521,423967


You can do calculation with data frames!

To compute the population density, for example, you have to divide `population` by `area`. How would you translate this into Python code?

In [86]:
states['density'] = states['population']/states['area']
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


You can extract a column as a *Series* by using its name in square brackets:

In [87]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

Furthermore, you can extract one or more columns as a *DataFrame* by using their names in double square brackets:

In [88]:
states[['population']]

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


Here, we want to extract *two* columns:

In [89]:
states[['population', 'density']]

Unnamed: 0,population,density
California,38332521,90.413926
Texas,26448193,38.01874
New York,19651127,139.076746
Florida,19552860,114.806121
Illinois,12882135,85.883763


Again: This was just a *very* brief overview over the most basic functions of `NumPy` and `pandas`. Check out their documentation if you want to leverage their full potential. (https://pandas.pydata.org/docs/, https://numpy.org/doc/).

Keep in mind that they are very powerful tools. Read the documentation and use the built-in help (?) to understand them!

In the next sections, we'll apply our knowledge to an example dataset and get to know more tools from our toolboxes.