# 1. Introduction to Numpy

<br>
<center>
<img src="https://files.realpython.com/media/Correlation-in-NumPy-and-Correlation-in-Pandas_Watermarked.69b8c063482f.jpg">
</center>


# 1. Introduction to Numpy

## 1.1. What is Numpy?
--- 

- Numpy is a Python **package**
    - Mathematical tools, such as calculating **mean**, **maximum** etc. 
    - **Vector operations** (linear algebra)
    - We want to operate over an entire **collection of values** (i.e. `lists`)



# 1. Introduction to Numpy


## 1.2. Installing and Importing Numpy
---

- First and foremost, we need to **install** the **package**
- To do this, write the following on your notebook

In [158]:
!pip install numpy



In [45]:
import numpy as np

# 1. Introduction to Numpy

## 1.3. Motivating Numpy
---
- Let's say we want to calculate the BMI index for several people

$$BMI = \frac{weight}{height^2}$$


In [1]:
height = [1.50, 1.60, 1.70]
weight = [60, 70, 80]

---
- What if we could operate **each element** of the `lists` with **each other**:
    - **Element-wise** operations

In [2]:
weight / height**2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

---
- **`[PROBLEM]`**: Python `lists` do not support **vector** operations
- Our intention was that each element of the list would be operated by the **same index** (element-wise operation)
- But `lists` do not know how to do that
- **`[SOLUTION]`**: Numpy!

In [4]:
import numpy as np

np_height = np.array(height)
np_weight = np.array(weight)
np_height

array([1.5, 1.6, 1.7])

---
- Trying the same calculation as before...

In [5]:
np_weight / np_height**2

array([26.66666667, 27.34375   , 27.6816609 ])

# 1. Introduction to Numpy


## 1.4. First Numpy tool - the `array`
---

- As we have seen in the example before a `np.array` is similar to a Python `list`
- Like an `int` or `float` a `np.array` is just another **type** - a more complex one, built by other people


---
- `np.array` is **indexed** just like a Python `list`
- So, we can use `[ ]` to access an **element**

In [12]:
weight[0], np_weight[0]

(60, 60)

---
- And we can also **slice** the `np.array`

In [42]:
weight[1:], np_weight[1:]

([70, 80], array([70, 80]))

# 1. Introduction to Numpy


## 1.5. Numpy **types** interacting with Python **types**
---

- **Numpy** numbers are different from Python numbers, because they have different **types**!

In [14]:
type(np_weight[0])

numpy.int64

---
- We can **operate** both **Numpy** numbers and **Python** numbers

In [16]:
result = np_weight[0] * 2
print(result)
type(result)

120


numpy.int64

- **`[NOTE]`**: **Type conversion** occurs! Always converted to **Numpy** numbers

# 1. Introduction to Numpy


## 1.6. Comparing `np.array` with `list`
---

- Comparing the different behavior between `np.array` and `list`

In [9]:
# PASTES THE LISTS TOGETHER
weight + height

[60, 70, 80, 1.5, 1.6, 1.7]

In [10]:
# ELEMENT-WISE SUM
np_weight + np_height

array([61.5, 71.6, 81.7])

---
- `np.array` can only contain one **type**
- `lists` can contain **different types**

In [8]:
np.array([1, '1', True])

array(['1', '1', 'True'], dtype='<U21')

- Everything got **converted** to a `string`! Be careful not to mix **types** with `np.array`

# 1. Introduction to Numpy

## 1.7. Exercise - Celsius to Fahrenheit, `np.array` style
---

Let's take advantage of `np.array` **element-wise** operations to calculate an **array** of temperatures.

$$F = \frac{9}{5} * C + 32$$

1. Create a `np.array` with the temperatures `[30, 50, 70]`
2. Convert the temperatures to Fahrenheit


**`[HINT]`**: `np.array([1,2]) * 2 = np.array([2, 4])`


---

In [23]:
temperatures = np.array([30, 50, 70])
(9/5) * temperatures + 32

array([ 86., 122., 158.])

# 1. Introduction to Numpy

## 1.8. Exercise - Manual Linear Regression
---

Once more, the objective is to take advantage of `np.array` **element-wise** operations.

A Linear Regression Model is defined as follows:

$$\hat{y} = \sum_{i=0}^{N}{a_i*x_i}$$

Our prediction is the total sum of the **element-wise** product between $a$ and $x$.
- $a$ is the vector of **model coefficients**
- $x$ is the data

---
We are trying to predict the number of home-runs that a league baseball player will do on average this year.
We have the following statistics for this player:
- Base 1 Runs: 1
- Stray Ball Catch: 3
- Ball Punts: 5

So, we have the following data:
- The **data**, `x=[1,3,5]`
- The **coefficients**, are respectively `a = [0.2, 0.5, 0.75]`

---

1. Create `np.array` with both the **data** `x` and **coefficients** `a`
2. Do the **element-wise** product with `x` and `a`
3. Do the `sum()` of the resulting product

In [24]:
coefs = np.array([0.2, 0.5, 0.75])
# variable 1 variable 2 variable 3
data = np.array([1, 3, 5])

---

In [25]:
result = sum(coefs * data)
result

5.45

# 1. Introduction to Numpy

## 1.9. `np.array` subsetting
---

- We can **subset** (take a few elements) from a `np.array`
- **Subsetting** can be done with the help of another `np.array`

---
- This **array** can be of **indexes**


In [374]:
sample = np.array([1, 2, 3, 4, 5])
sample[[0, 3, -1]]

array([1, 4, 5])

---
- Or an **array** of `booleans` (yes/no for **each** position)

In [34]:
sample[[True, False, True, False, True]]

array([1, 3, 5])

---
- If a `boolean` array (or `list`) is not of the same size...

In [33]:
sample[[True, False]]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 5 but corresponding boolean dimension is 2

# 1. Introduction to Numpy

## 1.10. Exercise - subsetting `np.array` with `booleans`
---

Get the first a last elements of a `range` of numbers from 1 to 5 using `np.array`.


1. Create a `np.array` with a `list` of numbers, that you get from a `range`
2. `print` the **first** and **last** elements of the `np.array`, using:
    - A `list` of numbered indexes
    - A `list` of `booleans`

---

In [73]:
array = np.array(range(1, 6))
print(array[[0, -1]])
print(array[[True, False, False, False, True]])

[1 5]
[1 5]


# 1. Introduction to Numpy

## 1.11. Further notes on `np.array` subsetting with `booleans`
---
- As you might recall , we can get a `boolean` as a result from **conditionals**

In [375]:
print(1==1)
print(1!=1)

True
False


---
- With `np.array` as we expect by now, when **operating** we get a `np.array` of the **same size**
- Because **operations** are **element-wise**!

In [37]:
print(np_weight)
np_weight == 60

[60 70 80]


array([ True, False, False])

---
- As we saw, we can **subset** an **array** with another **array of booleans** 
- We can get a `np.array` of `booleans` with a **condition**

In [44]:
people_with_60_kilos = np_weight == 60
np_height[people_with_60_kilos]

array([1.5])

# 1. Introduction to Numpy

## 1.12. Exercise - Get heights of people with less than 70 kilos

---

We want to **subset** our `np.array` of **heights** (`np_height`), with people that have **less** than 70 kilos.

1. Create a **condition** where `np_weight` is **less** than 70
2. **Subset** `np_height` with the resulting `boolean` `np.array`

**`[HINT]`**: Remember that one of the **boolean operators** is the `<`

In [377]:
np_height, np_weight

(array([1.5, 1.6, 1.7]), array([60, 70, 80]))

---

In [75]:
people_with_less_than_70_kilos = np_weight < 70
np_height[people_with_less_than_70_kilos]

array([1.5])

# 1. Introduction to Numpy

## 1.13 2-D `np.array` - Welcome to the Matrix
---

In [46]:
type(np_weight)

numpy.ndarray

- `numpy.ndarray` - stands for $N$-dimensional **array**
    - 1-D
    - 2-D
    - 3-D
    - ...
    - 56789-D

---
- Let's build a 2D `np.array` from `np_weight` and `np_height`

In [379]:
np_data = np.array([np_weight, np_height])
np_data

array([[60. , 70. , 80. ],
       [ 1.5,  1.6,  1.7]])

---
- We can use `np.array.shape` to find out what are the **dimensions** of our `np.array`

In [380]:
# ROWS X COLUMNS
np_data.shape

(2, 3)

- **`[NOTE]`**: this is not a **method** like `append()`. Methods have parenthesis `( )`. This is called an **attribute**, because it doesn't have parenthesis

# 1. Introduction to Numpy

## 1.14 Subsetting 2-D `np.array`
---

- We can also **index** the 2-D `np.array`, just like we would in a `list` of `lists`

```
        [0]    [1]   [2]
array(
       [[60. , 70. , 80. ],   [0]
       [ 1.5,  1.6,  1.7]]    [1]
       )
```

- **`[REMEMBER]`**: Indexes start at 0!

In [381]:
np_data[0][-1]

80.0

---
- We get the first list `[60. , 70. , 80. ]` with **index** 0
- Then, we get the last element `80.0` with **index** -1

---
- `np.arrays` allow us to be more flexible with our indexing.
- We can separate **sequential** indexes, by separating them with a **comma** `,`

In [382]:
np_data[0, -1]

80.0

# 1. Introduction to Numpy

## 1.14 Subsetting 2-D `np.array` (cont.)
---

- Suppose that we want to take both `height` and `weight` from the first person
- So, we want both first and second `lists`, but only with the first **elements**

---
- We can use `:` to get all the `lists`

In [384]:
np_data[:]

array([[60. , 70. , 80. ],
       [ 1.5,  1.6,  1.7]])

---
- We can now **index** both lists, at the same time, by using the `:` character and **sequential indexing**
- We use the same logic as if it were one list only

In [383]:
np_data[:, 0]

array([60. ,  1.5])

- **`[REMEMBER]`**: We can think of `:` as a `list` **slice** from beggining to end. Everything!

---
- **`[NOTE]`**: Watch what happens if we use separate square brackets `[ ]`

In [385]:
np_data[:][0]

array([60., 70., 80.])

- This is because `data[:]` returns both `lists` and we are just **indexing** the first
- So we have to use **sequential** indexing of Numpy (i.e. `data[:, 0]`)

---
- We can also **slice** the list using this notation

In [386]:
result = np_data[:, 1:]
print(result)
result.shape

[[70.  80. ]
 [ 1.6  1.7]]


(2, 2)

- We retain our **matrix** structure

---
- We can also do the inverse
- Select the first `list` and take **all** values using `:`

In [387]:
np_data[0, :]

array([60., 70., 80.])

- **`[MOTIVATION]`**: This **matrix** shape of our data is what is used when training models (Machine Learning)

# 1. Introduction to Numpy

## 1.15. Summary of Numpy Indexes
---

<br>
<center>
    <img src="https://www.oreilly.com/library/view/python-for-data/9781449323592/httpatomoreillycomsourceoreillyimages2172114.png">
    </center>



# 1. Introduction to Numpy

## 1.16. Exercise - subsetting 2-D `np.array`
---

We want to **subset** the following from `np_data`:
1. All values from the second `np.array`
2. The second value of **both** lists
3. The second value of **both** lists, BUT retaining the matrix structure
4. All values from **both** lists that respect the **condition** where people have **exactly** 70 kilos.

**`[HINT]`**: For step 2, 3, 4, **remember** that we can use `:` to get **both** `np.array`

**`[HINT]`**: For step 3, **remember** that we can **retain the matrix** structure by using **slicing**. Even though, it will return **just one value**!


In [388]:
np_data

array([[60. , 70. , 80. ],
       [ 1.5,  1.6,  1.7]])

---

In [391]:
# STEP 1
np_data[1], np_data[1, :]

(array([1.5, 1.6, 1.7]), array([1.5, 1.6, 1.7]))

In [392]:
# STEP 2
np_data[:, 1]

array([70. ,  1.6])

In [393]:
# STEP 3
np_data[:, 1:-1]

array([[70. ],
       [ 1.6]])

In [395]:
# STEP 3
np_data[:, 1:2]

array([[70. ],
       [ 1.6]])

In [396]:
# STEP 4
condition = np_weight == 70
np_data[:, condition]

array([[70. ],
       [ 1.6]])

# 1. Introduction to Numpy

## 1.17. Numpy statistics
---
- If your data is of small size, you can just look at it (family data)

- But what if we have **hundreds of datapoints**? (city wide survey?)

In [407]:
np_city = np.array(
    [
        20 + (np.random.rand(5000) * (120 - 20)),
        1.0 + (np.random.rand(5000) * (2.1 - 1.0)),
    ]
)

---
- We now have data for a **city-wide survey** of `weight` and `height`

In [408]:
np_city

array([[ 54.03310024,  86.30270423, 112.28793577, ...,  94.48193202,
         80.37810918, 108.46025459],
       [  1.10856127,   1.93866121,   1.65456619, ...,   1.36639437,
          1.50417781,   1.46008911]])

In [409]:
np_city.shape

(2, 5000)

---
- Statistics give us an **overall picture** of our data
    - What is the **mean** weight of the whole family?
    - What is the **minimum** and **maximum** weights?
    - For people between 1.70m and 2m whats is the **mean** weight?

---
- Numpy offers several statistical tools. One of them is the **mean**

- They come implemented in the Numpy package, so like we did for `np.array`, we do `np.mean`

In [410]:
# AVERAGE HEIGHT
np.mean(np_city[1, :])

1.5504235091174978

---
- But, our `np.array` also has **methods**

- And it includes `mean`

In [411]:
np_city[1, :].mean()

1.5504235091174978

# 1. Introduction to Numpy

## 1.18. Mean, Standard Deviation, Median, Minimum, Maximum
---

- **Mean** with `np.mean()` or `np.array.mean()`

In [412]:
print(np.mean(np_city[0, :]))
print(np_city[0, :].mean())

69.86521517246949
69.86521517246949


---
- **Standard Deviation** with `np.std()` or `np.array.std()`

In [413]:
print(np.std(np_city[1, :]))
print(np_city[1, :].std())

0.3190699058400676
0.3190699058400676


---
- **Median** is only accessible through the Numpy package, with `np.median()`

In [414]:
np.median(np_city[1, :])

1.5504371594230955

In [415]:
np_city[1, :].median()

AttributeError: 'numpy.ndarray' object has no attribute 'median'

---
- **Minimum** with `np.min()`, `np.array.min()` or Python `min()`

In [416]:
print(np.min(np_city[1, :]))
print(np_city[1, :].min())
print(min(np_city[1, :]))

1.000044577929344
1.000044577929344
1.000044577929344


---
- **Maximum** with `np.max()`, `np.array.max()` or Python `max()`

In [417]:
print(np.max(np_city[1, :]))
print(np_city[1, :].max())
print(max(np_city[1, :]))

2.098604466093362
2.098604466093362
2.098604466093362


# 1. Introduction to Numpy

## 1.19. Exercise - Calculating statistics
---

Calculate the following:
1. The **mean** `height` for the whole city
2. The **minimum** and **maximum** `weight`, for the whole city
3. The **median** `weight` for people taller than 1.7m 

In [422]:
# STEP 1
print(np_city[1].mean())
print(np.mean(np_city[1]))

1.5504235091174978
1.5504235091174978


In [424]:
# STEP 2
print(np_city[0].min())
print(np.min(np_city[0]))
print(min(np_city[0]))

20.00305132327707
20.00305132327707
20.00305132327707


In [425]:
# STEP 2
print(np_city[0].max())
print(np.max(np_city[0]))
print(max(np_city[0]))

119.99872656352959
119.99872656352959
119.99872656352959


In [428]:
# STEP 3
condition = np_city[1] > 1.9
subdata = np_city[0, condition]
print(np.median(subdata))

68.4113876709607


# 1. Introduction to Numpy

## 1.20. Other Functions
---



- **Sum** with `np.sum()`, `np.array.sum()` or Python `sum()`

In [113]:
print(sum(np_city[0, :]))
print(np.sum(np_city[0, :]))
print(np_city[0, :].sum())

7754.835365308759
7754.835365308741
7754.835365308741


---
- **Sort** with `np.sort()`, `np.array.sort()` or Python `sorted()`

In [121]:
print(sorted(np_city[0, :])[:10])
print(np.sort(np_city[0, :]))
# INPLACE - CHANGES VARIABLE
np_city[0, :].sort()
print(np_city[0, :])

[1.0000145004799257, 1.0001051477574863, 1.000143329489774, 1.0006611150888316, 1.000997089398832, 1.0013739505753136, 1.0015204943658784, 1.0017365801472062, 1.0018867052026383, 1.0021427820136912]
[1.0000145  1.00010515 1.00014333 ... 2.09896207 2.09900546 2.09902032]
[1.0000145  1.00010515 1.00014333 ... 2.09896207 2.09900546 2.09902032]


# 1. Introduction to Numpy

## 1.21. Summary
---



# 2. New Python `type` - Dictionaries

<br>
<center>
    <img src="https://files.realpython.com/media/How-to-Iterate-Through-A-Dictionary-in-Python_Watermarked.06d6547f531b.jpg">
</center>

# 2. New Python `type` - Dictionaries
## 2.1. Introduction
---

- Python **dictionaries** or `dict` are like a real-life dictionary

<center>
            <img src='img/dictionary.png' width=10%>
    </center>

---
- We **lookup** dictionaries, for **key** words and their corresponding **definition** or **value**
- In Python, it is the same thing, we have **keys** and **values**
- Each **key** coresponds to **one value**

# 2. New Python `type` - Dictionaries
## 2.2. Motivation
---
- Suppose we have a `list` of phone contacts (done in Class 1)

In [429]:
name = "50cent"
phone_number = 50

friend_name = "Xzibit"
friend_phone_number = 100200300

contact1 = [name, phone_number]
contact2 = [friend_name, friend_phone_number]
contacts = [contact1, contact2]
print(contacts)

[['50cent', 50], ['Xzibit', 100200300]]


---
- **`[PROBLEM]`**: If we want to get Xzibit's phone contact, we would have to **iterate** the `list` until we find Xzibit
- What if we could look for Xzibit by the name `Xzibit`.
- **`[SOLUTION]`**: Python dictionaries!

# 2. New Python `type` - Dictionaries
## 2.3. Making Python `dict`
---
- We can make `dict` using **curly brackets** `{ }`
- We use `:` to **assign** a **value** to a **key**

In [430]:
contacts = {
    '50cent': 50,
    'Xzibit': 100200300
}
print(contacts)
print(contacts['Xzibit'])

{'50cent': 50, 'Xzibit': 100200300}
100200300


- `'50cent'` and `'Xzibit'` are **keys**
- Both numbers are **values**

# 2. New Python `type` - Dictionaries
## 2.4. Exercise - Dictionary of Numpy data
---

1. Build a `dict` with the following data:
    - **Key** `'weight'` and **value** `np_weight`
    - **Key** `'height'` and **value** `np_height`
2. Finally, **sum** both `np_weight` and `np_height`, but **using the dictionary to access the values** (i.e. `dict['x'] + dict['y']`)

---

In [431]:
data = {
    'weight': np_weight,
    'height': np_height
}
data['weight'] + data['height']

array([61.5, 71.6, 81.7])

# 2. New Python `type` - Dictionaries
## 2.5. Dictionaries of dictionaries
---

- We can build complex **data structures** with `dict`
- We can have `lists` of `lists`, but we can also have `dict` of `dict` of `dict` of ...


In [432]:
contacts = {
    '50cent': {
        'number': 50,
        'address': {
            'city': 'USA',
            'zip_code': '50-50'
        }
    },
    'Xzibit': {
        'number': 100200300,
        'address': {
            'city': 'USA',
            'zip_code': '001-230'
        }
    },
}

---
- We can chain **sequential indexes**, given that we know the structure of our `dict`

In [433]:
print(contacts['Xzibit']['address']['zip_code'])

001-230


# 2. New Python `type` - Dictionaries
## 2.6. Iterating dictionaries
---
- **`[REMEMBER]`**: This is how we iterate a `list` with a `for` loop

In [434]:
numbers = range(5)
for element in numbers:
    print(element)

0
1
2
3
4


---
- Iterating a `dict` is the same thing
- The element we get though, is the `dict` **key**

In [435]:
for key in contacts:
    print(key)
    print(contacts[key])

50cent
{'number': 50, 'address': {'city': 'USA', 'zip_code': '50-50'}}
Xzibit
{'number': 100200300, 'address': {'city': 'USA', 'zip_code': '001-230'}}


---
- There is a way to iterate both **keys** and **values** at the same time, using the `dict.items()` method
- Since we are getting both a **key** and a **value** at the same time, we have to use **two iteration variables**.
- We'll call our **iteration variables** `key` and `value`

In [436]:
for key, value in contacts.items():
    print(key)
    print(value)

50cent
{'number': 50, 'address': {'city': 'USA', 'zip_code': '50-50'}}
Xzibit
{'number': 100200300, 'address': {'city': 'USA', 'zip_code': '001-230'}}


# 2. New Python `type` - Dictionaries
## 2.7. Summary
---


# 3. Functions with **named** arguments

<br>
<center>
    <img src="https://files.realpython.com/media/Pythons-None-Type-Null-in-Python_Watermarked.9d48d487f417.jpg">
</center>

# 3. Functions with **named** arguments
## 3.1. Guided Exercise - Min Max Scaling
---

- In Data Science, sometimes it's needed to **scale** data
- To **scale** is to shift the **domain** of the data (i.e. `[0, 100] -> [0, 1]`)
- This is helpful when developing predictive models

---
**Min Max Scaling** between 0 and 1:
$$x' = \frac{x - min(x)}{max(x) - min(x)}$$

**Min Max Scaling** between $a$ and $b$:
$$x' = a + \frac{(x - min(x))(b-a)}{max(x) - min(x)}$$

---
Pieces of the Puzzle:
- **`[ARGUMENT]`**: The data (we'll experiment with `np_weight`)
- **`[CALCULATION]`**: The `min` and `max` values
- **`[ARGUMENT]`**: Ranges `a` and `b`


In [348]:
# DATA
np_weight

array([60, 70, 80])

---
- Calculating `min` and `max` for `np_weight`

In [350]:
minimum = np.min(np_weight)
maximum = np.max(np_weight)
minimum, maximum

(60, 80)

---
- Scaling between 0 and 1
    - `a=0`
    - `b=1`
- Applying the first formula...

In [351]:
minimum = np.min(np_weight)
maximum = np.max(np_weight)

numerator = np_weight - minimum
denominator = maximum - minimum
numerator / denominator

array([0. , 0.5, 1. ])

---
- Applying the second formula for:
    - `a=10`
    - `b=100`

In [353]:
minimum = np.min(np_weight)
maximum = np.max(np_weight)

a = 10
b = 100
numerator = (np_weight - minimum)*(b-a)
denominator = maximum - minimum
a + (numerator / denominator)

array([ 10.,  55., 100.])

---
- Time to put it all into a function
- What are our **arguments**? What do we want to be able to change the function behavior?
    - `data` - **any data can be scaled**, not just `np_weight`
    - `a` - we can select any desired **lower bound**
    - `b` - we can select any desired **upper bound**

In [354]:
def min_max_scaling(data, a, b):
    minimum = np.min(data)
    maximum = np.max(data)
    
    numerator = (data - minimum) * (b - a)
    denominator = maximum - minimum
    return a + (numerator / denominator)

In [355]:
min_max_scaling(np_weight, 0, 1)

array([0. , 0.5, 1. ])

In [356]:
min_max_scaling(np_weight, 10, 567)

array([ 10. , 288.5, 567. ])

# 3. Functions with **named** arguments
## 3.2. Adding default behavior to functions
---
- What if we want to add **default** behavior to our **scaling function**?
- We want it **by default** to scale between 0 and 1

---
- We can make use of **named arguments** on the **function definition**
- **Named arguments** have **default values** 

In [153]:
def min_max_scaling(data, a=0, b=1):
    minimum = np.min(data)
    maximum = np.max(data)
    
    numerator = (data - minimum) * (b - a)
    denominator = maximum - minimum
    return a + (numerator / denominator)

---
- Now we can call the **function** without the need to specify the `a` or `b`

In [154]:
min_max_scaling(np_weight)

array([0. , 0.5, 1. ])

---
- When calling the **function**, we can **write** the names of the arguments. Like so:

In [358]:
min_max_scaling(np_weight, a=30, b=60)

array([30., 45., 60.])

---
- **`[NOTE]`**: You can still call the function without writing the arguments names


In [357]:
min_max_scaling(np_weight, 30, 60)

array([30., 45., 60.])

---
- **`[NOTE]`**: You can alter the order of the **named arguments**, as long as you write their name!

In [361]:
min_max_scaling(np_weight, b=60, a=30)

array([30., 45., 60.])

---
- **`[NOTE]`**: But you can't put a **named argument** before a **non-named argument**

In [157]:
min_max_scaling(b=60, np_weight, a=30)

SyntaxError: positional argument follows keyword argument (<ipython-input-157-7e30830ec238>, line 1)

- **`[MOTIVATION]`**: Imagine that a **function** had 100 arguments. You would have to know the **exact order** of each argument. What a headache! We can make our lives easier by using **named arguments** 

# 3. Functions with **named** arguments
## 3.7. Summary
---

# 4. Introduction to Pandas

<br>
<center>
        <img src="https://files.realpython.com/media/Intro-to-Exploratory-Data-Analysis-With-Pandas_Watermarked.81a7d7df468f.jpg">
</center>

# 4. Introduction to Pandas
## 4.1. Motivating Pandas
---

- Data Science works with hundreds of **data points**
- The **structure** of the data can vary greatly
    - Often it's tabular data - in form of a **table**

---
- We already had 2-D `np.array`
    - `np.array` can only have one **type**
    - We don't know what variable is what
    - Unless we use a `dict` we can get lost very easily

In [437]:
np_data

array([[60. , 70. , 80. ],
       [ 1.5,  1.6,  1.7]])

# 4. Introduction to Pandas
## 4.2. Installing and Importing Pandas
---
- Run the following command on your notebook

In [345]:
!pip install pandas



In [160]:
import pandas as pd

In [165]:
df = pd.read_csv('../housing_data.csv').drop('Unnamed: 0', axis=1)

# 4. Introduction to Pandas
## 4.3. Pandas DataFrames
---
- A `pd.DataFrame` is a **tabular** format for your data

- It gives you a `pd.DataFrame` - a table of rows (measurements) and columns (variables/features)
    - Every **row** is a measurement
    - Each observation has variables (**columns**)

In [166]:
df

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
0,856,854,0,,3,1Fam,TA,No,706.0,0.0,...,WD,0,Pave,8,856.0,AllPub,0,2003,2003,2008
1,1262,0,0,,3,1Fam,TA,Gd,978.0,0.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
2,920,866,0,,3,1Fam,TA,Mn,486.0,0.0,...,WD,0,Pave,6,920.0,AllPub,0,2001,2002,2008
3,961,756,0,,3,1Fam,Gd,No,216.0,0.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006
4,1145,1053,0,,4,1Fam,TA,Av,655.0,0.0,...,WD,0,Pave,9,1145.0,AllPub,192,2000,2000,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,546,546,0,,3,Twnhs,TA,No,0.0,0.0,...,WD,0,Pave,5,546.0,AllPub,0,1970,1970,2006
2915,546,546,0,,3,TwnhsE,TA,No,252.0,0.0,...,WD,0,Pave,6,546.0,AllPub,0,1970,1970,2006
2916,1224,0,0,,4,1Fam,TA,No,1224.0,0.0,...,WD,0,Pave,7,1224.0,AllPub,474,1960,1996,2006
2917,970,0,0,,3,1Fam,TA,Av,337.0,0.0,...,WD,0,Pave,6,912.0,AllPub,80,1992,1992,2006


- `pd.DataFrame` can have different **data types**
- `np.array` can't have different **data types**

# 4. Introduction to Pandas
## 4.4. DataFrame Structure Review

<center>
    <img src="https://doit-test.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png">
    </center>

# 4. Introduction to Pandas
## 4.5. Building `pd.DataFrame`
---

- We can easily build `pf.DataFrames` using `dict`
- Remember `weight` and `height` data?

In [171]:
data

{'weight': array([60, 70, 80]), 'height': array([1.5, 1.6, 1.7])}

---
- We can create a `pd.DataFrame` from a `dict` by using the Pandas **package function** `pd.DataFrame`

In [170]:
df = pd.DataFrame(data=data)
df

Unnamed: 0,weight,height
0,60,1.5
1,70,1.6
2,80,1.7


---
- **`[NOTE]`**: that we are using a **named argument** `data`
- **`[NOTE]`**: to view the contents of a `pd.DataFrame`, just type it's **variable name** at the end of the cell

---
- `print`ing a `pd.DataFrame` would result in the following output.

In [172]:
print(df)

   weight  height
0      60     1.5
1      70     1.6
2      80     1.7


# 4. Introduction to Pandas
## 4.6. Exercise - Loading data to a DataFrame
---

- We generally do not build DataFrames from dictionaries
- Because data is not shared in dictionaries
- Data is shared in multiple formats (i.e. `.csv` - comma separated values)

We can load `.csv` data by using pandas `read_csv` function.

```
dataframe = pd.read_csv('housing_data.csv')
```

Try loading the **housing data** file named `housing_data.csv`, and view the contents of the `pd.DataFrame`.

In [174]:
df = pd.read_csv('../housing_data.csv')
df

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
0,0,856,854,0,,3,1Fam,TA,No,706.0,...,WD,0,Pave,8,856.0,AllPub,0,2003,2003,2008
1,1,1262,0,0,,3,1Fam,TA,Gd,978.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
2,2,920,866,0,,3,1Fam,TA,Mn,486.0,...,WD,0,Pave,6,920.0,AllPub,0,2001,2002,2008
3,3,961,756,0,,3,1Fam,Gd,No,216.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006
4,4,1145,1053,0,,4,1Fam,TA,Av,655.0,...,WD,0,Pave,9,1145.0,AllPub,192,2000,2000,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,1454,546,546,0,,3,Twnhs,TA,No,0.0,...,WD,0,Pave,5,546.0,AllPub,0,1970,1970,2006
2915,1455,546,546,0,,3,TwnhsE,TA,No,252.0,...,WD,0,Pave,6,546.0,AllPub,0,1970,1970,2006
2916,1456,1224,0,0,,4,1Fam,TA,No,1224.0,...,WD,0,Pave,7,1224.0,AllPub,474,1960,1996,2006
2917,1457,970,0,0,,3,1Fam,TA,Av,337.0,...,WD,0,Pave,6,912.0,AllPub,80,1992,1992,2006


- **`[THOUGHT]`**: it is curious to think that a `DataFrame` is just a `dict`
    - It has **keys**, the **columns** or **variables**
    - Each **variable** has values
    - What we see on the screen is just a _pretty_ way of showing a `dict`

# 4. Introduction to Pandas
## 4.7. Head or Tail, and Columns - DataFrame Overview
---
- We can look at a few **datapoints** from a `pd.DataFrame` by using **methods**:
    - `pd.DataFrame.head()` to show the first 5 **rows**
    - `pd.DataFrame.tail()` to show the last 5 **rows**

In [175]:
df.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
0,0,856,854,0,,3,1Fam,TA,No,706.0,...,WD,0,Pave,8,856.0,AllPub,0,2003,2003,2008
1,1,1262,0,0,,3,1Fam,TA,Gd,978.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
2,2,920,866,0,,3,1Fam,TA,Mn,486.0,...,WD,0,Pave,6,920.0,AllPub,0,2001,2002,2008
3,3,961,756,0,,3,1Fam,Gd,No,216.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006
4,4,1145,1053,0,,4,1Fam,TA,Av,655.0,...,WD,0,Pave,9,1145.0,AllPub,192,2000,2000,2008


---
- But we can specify **how much** we want to see

In [178]:
df.tail(10)

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
2909,1449,630,0,0,,1,Twnhs,TA,Av,522.0,...,WD,0,Pave,3,630.0,AllPub,0,1970,1970,2006
2910,1450,546,546,0,,3,TwnhsE,TA,No,252.0,...,WD,0,Pave,5,546.0,AllPub,0,1972,1972,2006
2911,1451,1360,0,0,,3,1Fam,TA,Av,119.0,...,WD,0,Pave,8,1104.0,AllPub,160,1969,1979,2006
2912,1452,546,546,0,,3,Twnhs,TA,No,408.0,...,WD,0,Pave,5,546.0,AllPub,0,1970,1970,2006
2913,1453,546,546,0,,3,Twnhs,TA,No,0.0,...,WD,0,Pave,5,546.0,AllPub,0,1970,1970,2006
2914,1454,546,546,0,,3,Twnhs,TA,No,0.0,...,WD,0,Pave,5,546.0,AllPub,0,1970,1970,2006
2915,1455,546,546,0,,3,TwnhsE,TA,No,252.0,...,WD,0,Pave,6,546.0,AllPub,0,1970,1970,2006
2916,1456,1224,0,0,,4,1Fam,TA,No,1224.0,...,WD,0,Pave,7,1224.0,AllPub,474,1960,1996,2006
2917,1457,970,0,0,,3,1Fam,TA,Av,337.0,...,WD,0,Pave,6,912.0,AllPub,80,1992,1992,2006
2918,1458,996,1004,0,,3,1Fam,TA,Av,758.0,...,WD,0,Pave,9,996.0,AllPub,190,1993,1994,2006


---
- We can't see the names of all the **features/columns**
- We can do `pd.DataFrame.columns` to get a `list` of all the features that exist in the `pd.DataFrame` (82!)

In [179]:
df.columns

Index(['Unnamed: 0', '1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley',
       'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath',
       'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1',
       'Condition2', 'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual',
       'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces',
       'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars',
       'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',
       'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id',
       'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea',
       'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass',
       'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal',
       'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', '

- We can look at some of the names, but some of them are ambiguous, or don't tell us a lot (i.e. `MiscFeature`)


# 4. Introduction to Pandas
## 4.8. Guided Exercise - Feature descriptions
---

Luckily, our client gave us a file with the **descriptions** of the features, and what values they might take.
Here's a sample from your `data_description.txt` file.

```
MSSubClass: Identifies the type of dwelling involved in the sale.	
        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density
       
...
```

---
Puzzle pieces:
1. Reading/getting the contents of a `.txt` file
2. Separating text (`str.split()`) by the features
3. `for` loop to **iterate** the different feature texts of step 2.
4. Conditions (`if` and `in`) to **check** if current iteration is the desired item
5. Putting in all in a function

---
### Step 1
- Reading files in Python

In [245]:
PATH = '../data_description.txt'
with open(PATH, 'r') as fp:
    file_content = fp.read()

In [246]:
type(file_content)

str

---
- Lets check the content of the file

In [247]:
print(file_content[:100])

MSSubClass: Identifies the type of dwelling involved in the sale.	
        20	1-STORY 1946 & NEWER A


In [248]:
file_content[:100]

'MSSubClass: Identifies the type of dwelling involved in the sale.\t\n        20\t1-STORY 1946 & NEWER A'

---
### Step 2
- We can see that a **new line** is represented by a `'\n'` character.
- There is also a `'\t'` for a **tab character**

In [249]:
print('====\n====\t====\n\n=====')

====
====	====

=====


---
- We know that different features are separated by 2 **newlines** (`'\n\n'`)
- **`[REMEMBER]`**: `str.split()` separates **strings** by the given **character**

In [365]:
feature_texts = file_content.split('\n\n')
print(feature_texts[0])
len(feature_texts)

MSSubClass: Identifies the type of dwelling involved in the sale.	
        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES


79

In [362]:
# TOTAL FEATURES (79) + INDEX + ID + Unnamed: 0
len(df.columns)

82

---
### Step 3
- If we now `str.split()` the **feature text** by a space `' '`, and get the first word, it is the **feature name**

In [366]:
# TESTING FOR FIRST 10
for feature_text in feature_texts[:10]:
    print(feature_text.split(' ')[0])

MSSubClass:
MSZoning:
LotFrontage:
LotArea:
Street:
Alley:
LotShape:
LandContour:
Utilities:
LotConfig:


---
### Step 4
- Using an `if` to **catch** the respective feature text

In [370]:
for text in feature_texts:
    first_word = text.split(' ')[0]
    if 'Alley' in first_word:
        print('Found search result Alley')
        print(text)

Found search result Alley
Alley: Type of alley access to property
       Grvl	Gravel
       Pave	Paved
       NA 	No alley access


---
### Step 5
- Putting it all together

In [371]:
search_feature = 'Alley'

with open(PATH, 'r') as fp:
    file_content = fp.read()
feature_text = file_content.split('\n\n')
for text in feature_text:
    first_word = text.split(' ')[0]
    if search_feature in first_word:
        print(text)

Alley: Type of alley access to property
       Grvl	Gravel
       Pave	Paved
       NA 	No alley access


---
- Putting it in a **function**
- We only need to change `search_feature`

In [373]:
def feature_description(search_feature):
    with open(PATH, 'r') as fp:
        file_content = fp.read()
    feature_text = file_content.split('\n\n')
    for text in feature_text:
        first_word = text.split(' ')[0]
        if search_feature in first_word:
            print(text)
            return
feature_description("Alley")

Alley: Type of alley access to property
       Grvl	Gravel
       Pave	Paved
       NA 	No alley access


# 4. Introduction to Pandas
## 4.9. Column Access
---

- Selecting data from a `DataFrame` is easy! 
- Like a `dict` we use square brackets `[ ]` to access a variable
- The index is just the **variable name**



In [260]:
df['MSZoning']

0       RL
1       RL
2       RL
3       RL
4       RL
        ..
2914    RM
2915    RM
2916    RL
2917    RL
2918    RL
Name: MSZoning, Length: 2919, dtype: object

- This looks a little different from a `DataFrame`
- Let's check the `type`

In [261]:
type(df['MSZoning'])

pandas.core.series.Series

- A `Series` is a 1-D **array** that has a **name**

<center>
    <img src="https://doit-test.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png">
    </center>

- We can also obtain a `DataFrame` if we use a `list` of variables

In [264]:
list_of_variables = ['MSZoning']
df[list_of_variables].head()

Unnamed: 0,MSZoning
0,RL
1,RL
2,RL
3,RL
4,RL


- Using a `list` of **variables** also allows us to select **multiple columns**

In [266]:
list_of_variables = ['MSZoning', 'SalePrice']
df[list_of_variables].head()

Unnamed: 0,MSZoning,SalePrice
0,RL,208500.0
1,RL,181500.0
2,RL,223500.0
3,RL,140000.0
4,RL,250000.0


# 4. Introduction to Pandas
## 4.8. Exercise - Selecting columns with `Qual`
---

Checking our columns, we can see that we have several **variables** with the keyword **`Qual`**.


In [339]:
df.columns

Index(['Unnamed: 0', '1stFlrSF', '2ndFlrSF', '3SsnPorch', 'Alley',
       'BedroomAbvGr', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFinType1', 'BsmtFinType2', 'BsmtFullBath',
       'BsmtHalfBath', 'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'Condition1',
       'Condition2', 'Electrical', 'EnclosedPorch', 'ExterCond', 'ExterQual',
       'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Fireplaces',
       'Foundation', 'FullBath', 'Functional', 'GarageArea', 'GarageCars',
       'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'GarageYrBlt',
       'GrLivArea', 'HalfBath', 'Heating', 'HeatingQC', 'HouseStyle', 'Id',
       'KitchenAbvGr', 'KitchenQual', 'LandContour', 'LandSlope', 'LotArea',
       'LotConfig', 'LotFrontage', 'LotShape', 'LowQualFinSF', 'MSSubClass',
       'MSZoning', 'MasVnrArea', 'MasVnrType', 'MiscFeature', 'MiscVal',
       'MoSold', 'Neighborhood', 'OpenPorchSF', 'OverallCond', 'OverallQual',
       'PavedDrive', '

1. Create an empty `list` called `helper_list`, where we will store the selected **columns**
2. Get all available features in our `pd.DataFrame`, using `pd.DataFrame.columns`
3. **Iterate** the columns from step 2, with a `for` loop
4. **If** the column has the word **`Qual`** on it, `list.append()` the column to the `helper_list`
5. Subset the `pd.DataFrame` with the selected features

In [340]:
helper_list = []
all_columns = df.columns
for col in all_columns:
    if 'Qual' in col:
        helper_list.append(col)
helper_list

['BsmtQual',
 'ExterQual',
 'GarageQual',
 'KitchenQual',
 'LowQualFinSF',
 'OverallQual']

In [341]:
df[helper_list]

Unnamed: 0,BsmtQual,ExterQual,GarageQual,KitchenQual,LowQualFinSF,OverallQual
0,Gd,Gd,TA,Gd,0,7
1,Gd,TA,TA,TA,0,6
2,Gd,Gd,TA,Gd,0,7
3,TA,TA,TA,Gd,0,7
4,Gd,Gd,TA,Gd,0,8
...,...,...,...,...,...,...
2914,TA,TA,,TA,0,4
2915,TA,TA,TA,TA,0,4
2916,TA,TA,TA,TA,0,5
2917,Gd,TA,,TA,0,5


# 4. Introduction to Pandas
## 4.8. Row Access
---

- We can also select the **rows** of a `DataFrame`
- It also uses square brackets `[ ]` like a `list`
- This time around the **indexes** are just numbers

In [287]:
df[:5]

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
0,0,856,854,0,,3,1Fam,TA,No,706.0,...,WD,0,Pave,8,856.0,AllPub,0,2003,2003,2008
1,1,1262,0,0,,3,1Fam,TA,Gd,978.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
2,2,920,866,0,,3,1Fam,TA,Mn,486.0,...,WD,0,Pave,6,920.0,AllPub,0,2001,2002,2008
3,3,961,756,0,,3,1Fam,Gd,No,216.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006
4,4,1145,1053,0,,4,1Fam,TA,Av,655.0,...,WD,0,Pave,9,1145.0,AllPub,192,2000,2000,2008


- We can check the `DataFrame` **index** by doing `df.index`

In [267]:
df.index

RangeIndex(start=0, stop=2919, step=1)

In [270]:
# TYPE CONVERSION
list(df.index)[:20]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

<center>
    <img src="https://doit-test.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png">
    </center>

# 4. Introduction to Pandas
## 4.8. Row access with `pd.DataFrame.loc[]` and `pd.DataFrame.iloc[]`
---

- Remembering what **Numpy** allowed to do with `np.array`

In [275]:
np_weight, np_height

(array([60, 70, 80]), array([1.5, 1.6, 1.7]))

- We could do **complex subsetting** with `np.arrays`
- For example gettings the **weights** for people taller than 1.6 meters

In [278]:
condition = np_height >= 1.6
print(condition)
print(np_weight)
np_weight[condition]

[False  True  True]
[60 70 80]


array([70, 80])

- Or we could also give a list of specific **index**

In [290]:
np_weight[[0, -1]]

array([60, 80])

- There are two tools that help us select data in a complex way
    - `loc[]` - allows us to select rows **by value**
    - `iloc[]` - allows us to select rows **by index** (**`i`** is for **index** - index location)
- Let's simulate the **Numpy** example with a `pd.DataFrame`

- We can get a **condition**, much in the same way as **Numpy** does

In [311]:
feature_description("Heating")

Heating: Type of heating
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace


In [312]:
df["Heating"].head()

0    GasA
1    GasA
2    GasA
3    GasA
4    GasA
Name: Heating, dtype: object

In [314]:
condition = df["Heating"] == "Grav"
condition

0       False
1       False
2       False
3       False
4       False
        ...  
2914    False
2915    False
2916    False
2917    False
2918    False
Name: Heating, Length: 2919, dtype: bool

In [315]:
df.loc[condition]

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
155,155,572,524,0,,2,1Fam,TA,No,0.0,...,WD,0,Pave,5,572.0,AllPub,0,1924,1950,2008
514,514,789,0,0,,2,1Fam,TA,No,0.0,...,WD,0,Pave,5,768.0,AllPub,0,1926,1950,2007
636,636,800,0,0,,1,1Fam,Fa,No,0.0,...,ConLw,0,Pave,4,264.0,AllPub,0,1936,1950,2009
968,968,600,368,0,,2,1Fam,TA,No,0.0,...,WD,0,Pave,6,600.0,AllPub,0,1910,1950,2009
1144,1144,672,252,0,,2,2fmCon,TA,No,348.0,...,WD,0,Pave,5,672.0,AllPub,0,1941,1950,2010
1337,1337,693,0,0,Grvl,2,1Fam,TA,No,0.0,...,WD,0,Pave,4,693.0,AllPub,0,1941,1950,2006
1443,1443,952,0,0,,2,1Fam,TA,No,0.0,...,WD,40,Pave,4,952.0,AllPub,0,1916,1950,2009
1539,79,1128,1128,0,,4,2fmCon,TA,Mn,0.0,...,WD,0,Pave,12,840.0,AllPub,0,1910,1950,2010
2557,1097,1296,1296,0,,6,Duplex,TA,Mn,371.0,...,WD,0,Pave,12,1296.0,AllPub,0,1923,1950,2007


- Using `pd.DataFrame.iloc[]` only helps us with **index based selection**
- i.e. the 1st and 3rd row
- To select **multiple rows** one has to use a `list` of **indexes** as with **Numpy**

In [318]:
df.iloc[[1,3]]

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
1,1,1262,0,0,,3,1Fam,TA,Gd,978.0,...,WD,0,Pave,6,1262.0,AllPub,298,1976,1976,2007
3,3,961,756,0,,3,1Fam,Gd,No,216.0,...,WD,0,Pave,7,756.0,AllPub,0,1915,1970,2006


# 4. Introduction to Pandas
## 4.8. Exercise - using `pd.DataFrame.loc[]`
---

Select the `pd.DataFrame` rows for which:
1. The house `SalePrice` is above`200000`
2. The house was built after 2000 (column named `YearBuilt`)

**HINT**: Remember that we can get the intersection of two conditions using `&`

In [343]:
print(True & False)
cond1 = np.array([True, False])
cond2 = np.array([False, True])
print(cond1 & cond2)

False
[False False]


- **NOTE**: In this case `and` does not work

In [344]:
cond1 and cond2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [324]:
feature_description("YearBuilt")

YearBuilt: Original construction date


In [336]:
condition1 = df['SalePrice'] > 500000
condition2 = df['YearBuilt'] > 2000
df.loc[condition1 & condition2]

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,...,SaleType,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold
178,178,2234,0,0,,1,1Fam,TA,No,1904.0,...,New,0,Pave,9,2216.0,AllPub,0,2008,2009,2009
440,440,2402,0,0,,2,1Fam,TA,Gd,1767.0,...,WD,170,Pave,10,3094.0,AllPub,0,2008,2008,2009
769,769,1690,1589,0,,4,1Fam,TA,Gd,1416.0,...,WD,210,Pave,12,1650.0,AllPub,503,2003,2003,2010
803,803,1734,1088,0,,4,1Fam,TA,Gd,0.0,...,New,192,Pave,12,1734.0,AllPub,52,2008,2009,2009
898,898,2364,0,0,,2,1Fam,TA,Gd,2188.0,...,New,0,Pave,11,2330.0,AllPub,0,2009,2010,2010
1046,1046,1992,876,0,,4,1Fam,TA,Av,240.0,...,New,0,Pave,11,1992.0,AllPub,214,2005,2006,2006


# 4. Introduction to Pandas
## 4.8. Summary
---