# Review of Day 1

## Fundamentals

### Data Types

Everything in Python is an object, and every object has a type.

You can check the type of any Python object via `type(obj)`

Let's review the most important ones.

**Integers** – Whole Numbers

In [None]:
i = 3
i

In [None]:
type(i)

**Floats** – Decimal Numbers

In [None]:
f = 3.4
f

In [None]:
type(f)

**Strings** – Bits of Text

In [None]:
s = 'python'
s

In [None]:
type(s)

**Lists** – Ordered collections of other Python objects

In [None]:
l = ['a', 'b', 'c']
l

In [None]:
type(l)

**Dictionaries** – A collection of key-value pairs, which let you easily look up the value for a given key

In [None]:
d = {'a': 1,
     'b': 2,
     'z': 26}
d

In [None]:
type(d)

**DataFrames** - Tabular datasets. Part of the Pandas library.

Here we create a new DataFrame from a list. And we also explicitly tell Pandas about the columns names.

In [None]:
import pandas as pd
list_with_data = [ ('Arno', 3, 'green'), ('Jannick', 42, 'blue') ]
cols = ['Name', 'Number', 'Color']
df = pd.DataFrame(list_with_data, columns=cols)
df

In [None]:
type(df)

Alternatively we can also create a DataFrame from a Dictionary. In this case the Dictionary _keys_ are used as the column names.

In [None]:
import pandas as pd
data_as_dict = [
    {'Name':'Arno', 'Number': 3, 'Color': 'green'},
    {'Name':'Jannick', 'Number': 42, 'Color': 'blue'}
    ]
df = pd.DataFrame(data_as_dict)
df

**Series** - 1-Dimensional data structure of homogenous contents. DataFrames are composed of Series.

In [None]:
series = df['Name']
series

In [None]:
type(series)

### The `type` Function

You can use the `type` function to determine the type of any object.

In [None]:
x = [1, 2, 3]
type(x)

In [None]:
x = 'hello'
type(x)

## Bonus

There are two more common data types in Python that I quickly want to show you:

**Tuples** - Similar to a list but _immutable_, that is to say, once it has been assigned the contents can't be changed. You can create tuples using normal parentheses `()`:

In [None]:
my_tuple = ("Arno", 42, "Schwäbisch Hall")
my_tuple

In [None]:
type(my_tuple)

In [None]:
my_tuple[2]

**Sets** - Unordered collection of unique objects. Sets are created with curly braces `{}`, just like dictionaries but they don't have keys and values, but only elements they contain:

In [None]:
my_set = {"Apple", "Banana", "Strawberry", "Apple", "Apple", "Kiwi"}
my_set

In [None]:
type(my_set)

In [None]:
"Apple" in my_set

In [None]:
"Apricot" in my_set

We won't be using **tuples** or **sets** much in the Data Science context. But it is good to be aware that they exist.

We could use them for various "tricks". For example, knowing that sets don't accept duplicates and that all Python container objects support the `len()` method we could do the following to find out how many discting movie directors there are in total in our movies dataset:

In [None]:
movies = pd.read_csv('../data/movies.csv')
directors = movies['director_name']
directors.head()

In [None]:
movies = pd.read_csv('../data/movies.csv')
directors = movies['director_name']
directors.head()

In [None]:
len(directors)

In [None]:
len(set(directors))

We see that there are 2395 _unique_ director names in the dataset. But we wouldn't normally use this approach. Pandas has specialized operations for all common data analysis tasks.

`directors.drop_duplicates().count()` is how it could be done, if you are curious. :-)

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

## Packages, Modules, and Functions

### Packages

*Packages* (generally synonymous with *modules* or *libraries*) are extensions for Python featuring useful code.

The **DataFrame** type comes in the **Pandas package**.

### Functions

*Functions* are executable Python code stored in a name, just like a regular variable.

In [None]:
def my_function(name, city):
    print(f"Hello {name}. Welcome to {city}.")

In [None]:
type(my_function)

You can call a function by putting parentheses after its name and, if the function supports it, including *arguments* to it:  `my_function(argument_1, argument_2)`

In [None]:
my_function("Ferdinand", "Schwäbisch Hall")

### Attributes and Methods

Python objects (that's everything in Python) come with *attributes*, or internal information accessible through dot syntax:
```python
myobject.attribute
```

Attributes can be handy when you want to learn more about an object.

In [None]:
df.shape

**Some attributes** actually **contain functions**, in which case we call them *methods*.

In [None]:
df.set_index('Name')

### DataFrames and Series

When you extract individual rows or columns of DataFrames, you get a 1-dimensional dataset called a *Series*.

Series look like lists but their data must be all of the same type.

## Importing Data

```python
import pandas as pd
data = pd.read_csv('myfile.csv')
```

## Subsetting and Filtering

There are two primary ways of subsetting data:

- **Selecting** - Including certain *columns* of the data while excluding others

- **Filtering** - Including only certain *rows* with data that meet some criterion

### Selecting

Selection is done with _just the brackets_ `[]`.  <br>
Pass a single column name (as a string) or a list of column names.

```python
# The column "my_column", returned as a Series
df['my_column']

# The columns "column_1" and "column_2" returned as a DataFrame 
df[['column_1', 'column_2']]
```

If you pass a list, the returned value will be a DataFrame.
If you pass a single column name, it will be a Series.

### Filtering

Accessing a subset of rows is done with the `.loc` accessor and brackets. <br>
Pass in a row index, a range of row indices, or a list of row indices.

```python
# The fifth (zero-indexing!) row, returned as a Series
df.loc[4]

# The second, third, and fourth rows, returned as a DataFrame
df.loc[1:3]

# The second, fourth, and sixt rows, returned as a DataFrame
df.loc[[2,4,6]]
```

DataFrames can be **filtered** by passing a *condition* in brackets.

```python
# Keep rows where `condition` is true
df[condition]
```

Conditions are things like tests of equality, assertions that one value is greater than another, etc.

```python
# Keep rows where the value in "my_column" is equal to 5
filt = df['my_column'] == 5
df[filt]
```

```python
# Combining conditions
# Keep rows where my_column is less than 3 OR greater than 10
filt1 = df['my_column'] < 3
filt2 = df['my_column'] > 10
filt_combined = filt1 | filt2
df[filt_combined]
```

### Selecting and Filtering Together

Using `.loc`, it's possible to do selecting and filtering all in one step.

```python
# Filter down to rows where column_a is equal to 5,
# and select column_b and column_c from those rows
filt = df['column_a'] == 5
cols = ['column_b', 'column_c']
df.loc[filt, cols]
```

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>