# Lesson 3 - Gathering and Cleaning Data

![pandas](https://miro.medium.com/max/791/1*e7lYKpF5FJYjNMVPlQgaKg.png)  
**Source**: https://analyticsindiamag.com/

## Outline for this Noteboook

1. Recap of Module 2
2. Learning Outcomes for the Module
3. Introduction to pandas  
4. Series
5. DataFrames
6. Indexing - Accessing and Selecting Data with pandas

# 1. Recap of Module 2

In the last module we covered

1. Lists methods and how in Python they can be treated as arrays and matrices that will hold our data for us in the same way as rows and columns do in Excel. Lists are extremely powerful and versatile data structures and they can be used in almost every aspect of the data analytics cycle.
- **In Python**

```python
my_array = [10, 20, 30, 40, 50, 60, 70, 80, 90]
```

- **In Spreadsheets**
|  | A |
|:--------:|:--------:|
| 1 | my_array |
| 2 | 10 |
| 3 | 20 |
| 4 | 30 |
| 5 | 40 |
| 6 | 50 |
| 7 | 60 |
| 8 | 70 |
| 9 | 80 |
| 10 | 90 |


2. NumPy is a library for fast computations on arrays and matrieces, and it is built on top of the C and Fortran programming languages.
3. Where possible, we can take advatage of broadcasting instead of using loops to apply an operation to every element in an array.
- **With Python Lists**

```python
a_list = []

for num in lots_of_nums_list:
    a_list.append(num + 5)
```

- **With NumPy**

```python
new_array = nums_numpy_array + 5
```

4. Generating random data allows us to test models and functions and numpy has a lot of functions to help us generate random data. Some of the functions are `np.ones`, `np.random.random`, `np.linspace`, and many more.
5. Masking is a type of filtering method that allows us to slice and dice the data given a condition or a set of them. It is, in a way, similar to constructing if-else statements with regular Python code.

```python
import numpy as np

# in the thousands
habitants_per_state = np.array([1000, 700, 1100, 500, 300, 450, 640])

high_population_mask = (habitants_per_state > 600)

habitants_per_state[high_population_mask] # returns array([1000, 1100, 700, 640])
```

6. List comprehensions are a type of for loop that gives us the ability to generate a list from repeated instructions. The main differences between loops and list comprehensions is that in the former, the action takes place after defining the for loop while in the latter, the action takes place at the beginning.

```python
## for loops

a_new_list = []

for a_element in some_list:
    a_new_list.append(a_element + 2)
    
## lists comprehensions
a_new_list = [a_element + 2 for a_element in some_list]
```

# 2. Learning Outcomes

In this module you will,

1. Learn how to create and load datasets to Python using the pandas library
2. Learn how to manipulate different datasets
3. Learn how to clean and prepare data for analysis
4. Understand why data preparation is one of the most important steps in the data analytics cycle

# 3. Introduction to pandas

![pandas](https://i.redd.it/c6h7rok9c2v31.jpg)  
**Source**: https://pandas.pydata.org/

[pandas](https://pandas.pydata.org/) is a Python library originally developed with the goal of making data manipulation and analysis in Python easier. The library was created by Wes McKinney and collaborators, and it was first released as an open source library in 2010. It has been designed to work (very well) with tabular data. In essence, pandas gives you, in a way, the same capabilities you would get when working with data in tools such as Microsoft Excel or Google Spreadsheets, but with the added benefit of allowing you to use and manipulate more data.

The pandas library is also built on top of NumPy, this means that a lot of the functionalities that you learned in the previous module will transfer seamlessly to this lesson and this new tool we are about to explore. What you will find in pandas is, the ability to control your NumPy arrays as if you were viewing them in a spreadsheet.

Some of pandas main characteristics are:

- Straightforward and convinient way for loading and saving datasets from and into different formats
- Swiss army knife for data cleaning
- Provides the same broadcasting operations as NumPy, hence, were possible, avoid loops...
- Allows for different data types and structures inside its two main data structures, Series and DataFrames
- Provides functionalities for visualising your data

pandas, like NumPy, also has a industry standard alias that we will be using throughout the course. This library is usually imported as `pd`.

```python
import pandas as pd
```

Just like NumPy has the very efficient data structure called `ndarray`'s, pandas, as NumPy's child, has its own data structures called `DataFrame`s, which are the equivalent of a NumPy matrix with many more functionalities, and `Series` (the equivalent of a NumPy array). We will cover these two structures next.

**Warning:** It is possible that the control boost you will feel as you begin to learn how to use pandas to clean, manipulate, and analyse data, will prevent you from going back to using the tools you have been using in the past (e.g. Excel, Google Sheets, regular calculators, etc.). 😎

## DataFrames and Series in pandas

![pandas](https://media.giphy.com/media/txsJLp7Z8zAic/giphy.gif)

Before we are able to import data into Python from outside sources, we'll walk over how to transform existing data (i.e., data we will come up with), into the two main data structures of pandas, `DataFrame`s and `Series`. We will do so through several different avenues, so let's first talk about what `DataFrame`s and `Series` are.

A `DataFrame` is a data structure particular to pandas that allows us to work with data in a tabular format. You can also think of a pandas DataFrames as a NumPy matrix but with (to some extent and depending on the user) more flexibility. Some characteristics of `DataFrame`s are:

- they have a two-dimesional matrix-like shape by default but can also handle more dimensions (e.g. with a multilevel index)
- their rows and columns are clearly defined with a visible index for the rows and names (or numbers) for the columns
- pivot tables, which are one of the main tools of spreadsheets, are also available in pandas
- lots of functionalities for reshaping, cleaning, and munging the data
- Indexes can be strings, dates, numbers, etc.
- very powerful and flexible `.groupby()` operation

A pandas `Series` is the equivalent of a column in a pandas `DataFrame`, a one-dimensional numpy array, or a column in Excel. In fact, since pandas derives most of its functionalities from NumPy, you can tranform a Series data structure (and also a `DataFrame`) back into a NumPy array (or matrx) by adding the attribute `.values` to it. A Series has most of the functionalities you will see in a `DataFrame` and they can be concatenated to form a complete `DataFrame` as well.

Let's first start by importing `pandas` with its industry alias, `pd`, and then check the version we have installed.

**Note:** At the time of writing, the latest version of pandas is 1.1.4.

In [3]:
import pandas as pd
import numpy as np

In [3]:
!pip install --upgrade pandas

Requirement already up-to-date: pandas in c:\users\monch.mercader\anaconda3\lib\site-packages (1.1.4)


In [4]:
pd.__version__

'1.1.4'

# 4. Series

Let's create some fake data first and turn it into a pandas `Series`. We will do so in the following ways:
- with lists
- with NumPy arrays
- and with a dictionary containing lists and tuples

Say we have data for a large order of pizzas we purchased a while back for a friends gathering. We ordered the pizzas from different stores and now we would like to have a look at the quantity we order using a pandas `Series` and assign it to a variable for later use. Let's see what the data looks like first in a list.

In [5]:
# This will be your fake pizza data representing amount of pizzas purchased
[2, 1, 6, 5, 1, 4, 2, 6, 2, 1]

[2, 1, 6, 5, 1, 4, 2, 6, 2, 1]

To create a pandas Series we use the `pd.Series(data= , name=)` method, pass our data through the `data=` parameter and give it a name using the `name=` parameter (the `name` parameter is optional though).

In [8]:
# This will be your first pandas Series
first_series = pd.Series(data=[2, 1, 6, 5, 1, 4, 2, 6, 2, 1], name='pizzas')
first_series

0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64

Notice that when we visualize a pandas Series we can immediately see the index next to the array.

We can also use NumPy arrays for the data we pass into our Series. As noted earlier, while using pandas we are essentially using NumPy data structures under the hood.

In [9]:
second_series = pd.Series(data=np.arange(20, 30))
second_series

0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
dtype: int32

Another neat functionality of Series is that they are not bound to only having a numerical index, unlike lists and NumPy arrays. Let's look at an example where we add our own index to a pandas Series. Note that indexes can be strings and dates as well.

In [10]:
third_series = pd.Series(data=np.arange(8), 
                          index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], 
                          name='random_data')
third_series

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
Name: random_data, dtype: int32

In the previous module, we spent quite some time on NumPy because it is a great segway into pandas since a lot of the methods, and the slicing and dicing techniques you've already learned, will be applicable to pandas data structures as well. For example, broadcasting operations over an entire array, instead of using a loop, are perfectly doable operations with pandas Series.

In [11]:
third_series['d']

3

In [12]:
# add 5 to every element in our second_series
second_series + 5

0    25
1    26
2    27
3    28
4    29
5    30
6    31
7    32
8    33
9    34
dtype: int32

In [13]:
first_series

0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64

In [14]:
# raise every element to the power of 3
first_series ** 3

0      8
1      1
2    216
3    125
4      1
5     64
6      8
7    216
8      8
9      1
Name: pizzas, dtype: int64

Keep in mind though that, just like with NumPy arrays, when we broadcast an operation on a pandas object, the change won't happen inplace so we would have to assign the changed object to a new variable, or back into the original one, to keep the changes in memory.

In [15]:
# the Series did not keep the changes
print(first_series)

0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64


In [27]:
# now the Series will keep the changes
first_series = first_series ** 3
print(first_series)

0      8
1      1
2    216
3    125
4      1
5     64
6      8
7    216
8      8
9      1
Name: pizzas, dtype: int64


It is worth mentioning again that if you would like to access the NumPy `ndarray` data structure underneath a pandas Series, you can do so by calling the attribute `.values` on the Series. For example:

In [28]:
first_series.values

array([  8,   1, 216, 125,   1,  64,   8, 216,   8,   1], dtype=int64)

In [29]:
type(first_series.values)

numpy.ndarray

We can also use a dictionary to create our pandas Series. The only caveat is that since key-value pairs can contain a lot of data, we have to explicitely call out the data we want in the rows by using the name of the key on the dictionary. If we do not select the key for the data we want, it would assign the key to the index of the Series and the values to the corresponding elements of such keys. The result won't be any better than using the regular dictionary itself unless, of course, there is a need for this kind of structure.

Let's look at an example.

In [30]:
first_series

0      8
1      1
2    216
3    125
4      1
5     64
6      8
7    216
8      8
9      1
Name: pizzas, dtype: int64

In [32]:
pizzas = {'pizzas': [2, 1, 6, 5, 1, 4, 2, 6, 2, 1],
         'pizzas_dos': np.random.randint(2, 20, 10)}
# Not good, one index only
fourth_series = pd.Series(data=pizzas)
fourth_series

pizzas                 [2, 1, 6, 5, 1, 4, 2, 6, 2, 1]
pizzas_dos    [14, 19, 11, 5, 13, 13, 16, 17, 19, 17]
dtype: object

Notice that what we got back was a 1-element pandas Series where the key `pizza` is now the index and the amount of pizzas we purchased are, still as a list, represented as 1 element. To fix this, let's explicitely call out the values of our `pizza` key.

In [34]:
# good example
fourth_series = pd.Series(data=pizzas['pizzas_dos'])
fourth_series

0    14
1    19
2    11
3     5
4    13
5    13
6    16
7    17
8    19
9    17
dtype: int32

In some instances we might want to use the default behavior. For example, when we have one key mapping to one single element in a dictionary.

In [37]:
states_city = {
    'NSW':'Sydney',
    'VIC':'Melbourne',
    'SA':'Adelaide',
    'TAS':'Hobart',
    'WA':'Perth',
    'QLD':'Brisbane',
    'NT':'Darwin',
    'ACT':'Canberra',
}

In [38]:
# this is a nice example
sc_series = pd.Series(states_city)
sc_series

NSW       Sydney
VIC    Melbourne
SA      Adelaide
TAS       Hobart
WA         Perth
QLD     Brisbane
NT        Darwin
ACT     Canberra
dtype: object

Tuples work in the same way as lists when we pass them into a pandas Series, but be careful with sets though. Since sets are moody and don't like order, pandas cannot represent their index well and thus, is unable to build Series or DataFrames from them.

In [40]:
len(list('abcdefg'))

7

In [42]:
# this works well
some_tuple = (40, 3, 2, 10, 31, 29, 74)
pd.Series(some_tuple, index=list('abcdefg'))

a    40
b     3
c     2
d    10
e    31
f    29
g    74
dtype: int64

In [43]:
# this does not work
some_set = {40, 3, 2, 10, 31, 29, 74}
pd.Series(some_set, index=list('abcdefg'))

TypeError: 'set' type is unordered

## Exercise 1

1. Create a pandas Series of 100 elements using a linearly spaced array from 50 to 75. Call it `my_first_series` and print the results.
2. Multiply the array by 20 and assign it to a new variable called, `my_first_broadcast`. Print the results.

In [50]:
my_first_series = pd.Series(np.linspace(50,70,num=100))
my_first_series

0     50.000000
1     50.202020
2     50.404040
3     50.606061
4     50.808081
        ...    
95    69.191919
96    69.393939
97    69.595960
98    69.797980
99    70.000000
Length: 100, dtype: float64

In [53]:
my_first_broadcast = my_first_series*20
print(my_first_broadcast)

0     1000.000000
1     1004.040404
2     1008.080808
3     1012.121212
4     1016.161616
         ...     
95    1383.838384
96    1387.878788
97    1391.919192
98    1395.959596
99    1400.000000
Length: 100, dtype: float64


## Exercise 2

1. Create a pandas Series from an array of integers from 100 to 500 in steps of 50, and with a index made up of letters. Call it `first_cool_index` and print it.
2. Do a floor division by 11 on the entire array, assign the result a variable called `low_div` and print the result.

In [64]:
first_cool_index = pd.Series(np.arange(100,500,50), index=list('leonardo'))
first_cool_index

l    100
e    150
o    200
n    250
a    300
r    350
d    400
o    450
dtype: int32

In [65]:
first_cool_index['o']

o    200
o    450
dtype: int32

## Exercise 3

1. Create a pandas Series of 7 elements from a dictionary where the keys are a sport and the value is a famous player in that sport. Call the Series `sports_players` and print it.

In [71]:
sports = {
    'Football':'Ronaldo',
    'Golf':'Dustin Johnson',
    'Basketball':'Michael Jordan',
    'F1':'Daniel Riccardo',
    'Handball':'Mikket Hansen',
    'Bowling':'Jason Belmonte',
    'Baseball':'Fernando Tatis Jr'
}
sports
sports_players = pd.Series(sports)
sports_players

Football                Ronaldo
Golf             Dustin Johnson
Basketball       Michael Jordan
F1              Daniel Riccardo
Handball          Mikket Hansen
Bowling          Jason Belmonte
Baseball      Fernando Tatis Jr
dtype: object

In [72]:
sports_players

Football                Ronaldo
Golf             Dustin Johnson
Basketball       Michael Jordan
F1              Daniel Riccardo
Handball          Mikket Hansen
Bowling          Jason Belmonte
Baseball      Fernando Tatis Jr
dtype: object

# 5. DataFrame's

![dataframe](pictures/dataframes.png)

You can think of a pandas DataFrame as a collection of Series with the difference being that all of the values in those Series will share the same index once they are in the DataFrame.

Another distinction between the two is that you can have a DataFrame of only one column, but you cannot have a Series of more than one (or at least you shouldn't since that is what the DataFrame is for).

Let's now create some fake data and reshape it into a pandas `DataFrame` object. We will do so in the following ways:
- a dictionary object with lists and tuples
- lists and/or tuples
- NumPy arrays
- multiple pandas Series

One of the fastest and more common ways to construct a DataFrame is by passing in a Python dictionary to the `data=` parameter in the `pd.DataFrame()` method. Doing this with dictionaries can save us time with having to name each one of the columns in our DataFrame.

In [73]:
# Create a dictionary of fake pizza data
data_le_pizza = {
    'pizzas': [2, 1, 6, 5, 1, 4, 2, 6, 2, 1], # some fake pizzas purchased
    'price_pizza': (20, 16, 18, 21, 22, 27, 30, 21, 22, 17), # some fake prices per pizza 
    'pizzeria_location': ['Sydney', 'Sydney', 'Seville', 'Perth', 'Perth', 'Melbourne',
                          'Sydney', 'Seville', 'Melbourne', 'Perth']
}

data_le_pizza

{'pizzas': [2, 1, 6, 5, 1, 4, 2, 6, 2, 1],
 'price_pizza': (20, 16, 18, 21, 22, 27, 30, 21, 22, 17),
 'pizzeria_location': ['Sydney',
  'Sydney',
  'Seville',
  'Perth',
  'Perth',
  'Melbourne',
  'Sydney',
  'Seville',
  'Melbourne',
  'Perth']}

In [74]:
# Check the data in the dictionary
data_le_pizza['pizzas']

[2, 1, 6, 5, 1, 4, 2, 6, 2, 1]

In [75]:
# remember the get method
data_le_pizza.get('price_pizza', 'not here')

(20, 16, 18, 21, 22, 27, 30, 21, 22, 17)

In [76]:
# we ordered international pizza, literally : )
data_le_pizza['pizzeria_location']

['Sydney',
 'Sydney',
 'Seville',
 'Perth',
 'Perth',
 'Melbourne',
 'Sydney',
 'Seville',
 'Melbourne',
 'Perth']

In [77]:
# we can pass in the dictionary as it is
df_la_pizza = pd.DataFrame(data=data_le_pizza)
df_la_pizza.head()

Unnamed: 0,pizzas,price_pizza,pizzeria_location
0,2,20,Sydney
1,1,16,Sydney
2,6,18,Seville
3,5,21,Perth
4,1,22,Perth


Notice how our new object, the pandas DataFrame, resembles the way we would see data in a spreadsheet. In addition, the keys of our dictionary map perfectly to the dataframe column names and the values to, well, their respective columns.

You can access the data inside your new DataFrame by calling the names of your columns as attributes (like a method without the round brackets or parentheses) or as a key in a dictionary. Note though that you can only access the columns of a dataframe as if it were an attribute if the name of the column has no spaces in it, otherwise, it can only be accessed as a key inside a dictionary.

In [78]:
# access the pizzas column as an attribute
df_la_pizza.pizzas

0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64

In [79]:
# access the pizzas variable as the key in a dictionary
df_la_pizza['pizzas']

0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64

You can broadcast operations to an entire column the same way you did with the Series in this lesson and the NumPy `ndarray`s in the previous module.

In [81]:
df_la_pizza['pizzas'] + 2

0    4
1    3
2    8
3    7
4    3
5    6
6    4
7    8
8    4
9    3
Name: pizzas, dtype: int64

You can also add the values to an entire DataFrame or subsection of it, although this might not be possible or desirable if all of the columns contain different data types, but it is still good to know that you can. For example, the following code will give you an error because there is a column with string data types in it, but the subsequent one, the group of numerical columns, won't.

In [82]:
df_la_pizza + 2

TypeError: can only concatenate str (not "int") to str

In [83]:
df_la_pizza[['pizzas', 'price_pizza']] + 2

Unnamed: 0,pizzas,price_pizza
0,4,22
1,3,18
2,8,20
3,7,23
4,3,24
5,6,29
6,4,32
7,8,23
8,4,24
9,3,19


DataFrames have several useful attributes such as `.index` and `.columns` that allows us to retrieve these pieces of information from the dataframe.

In [84]:
# shows us the start, stop, and step of our DataFrame's index, a.k.a. the range of the index
df_la_pizza.index

RangeIndex(start=0, stop=10, step=1)

In [85]:
# shows the names of the columns we have in our DataFrame
df_la_pizza.columns

Index(['pizzas', 'price_pizza', 'pizzeria_location'], dtype='object')

We can also add new columns by passing in the name of the new column as a key just like in a dictionary, and the corresponding values as an operation after the equal sign. The kind of assignment identical to that used when creating new variables.

In [94]:
df_la_pizza['new_pizzas'] = df_la_pizza['pizzas'] * 3.5
df_la_pizza

pizza_attr,pizzas,price_pizza,pizzeria_location,new_pizzas
numbers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2,20,Sydney,7.0
1,1,16,Sydney,3.5
2,6,18,Seville,21.0
3,5,21,Perth,17.5
4,1,22,Perth,3.5
5,4,27,Melbourne,14.0
6,2,30,Sydney,7.0
7,6,21,Seville,21.0
8,2,22,Melbourne,7.0
9,1,17,Perth,3.5


pandas also gives us the option of naming the set of columns we have as well as the index column of our DataFrame. We can do this by calling the sub-attribute `.name` on the `.columns` and `.index` attributes of our DataFrame. Let's name our columns array, `pizza_attr`, for pizza attributes, and let's name our index array, `numbers`, to see this functionality of pandas in action.

In [95]:
df_la_pizza.columns.name = 'pizza_attr'
df_la_pizza.index.name = 'numbers'
df_la_pizza

pizza_attr,pizzas,price_pizza,pizzeria_location,new_pizzas
numbers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2,20,Sydney,7.0
1,1,16,Sydney,3.5
2,6,18,Seville,21.0
3,5,21,Perth,17.5
4,1,22,Perth,3.5
5,4,27,Melbourne,14.0
6,2,30,Sydney,7.0
7,6,21,Seville,21.0
8,2,22,Melbourne,7.0
9,1,17,Perth,3.5


Notice how the new element assignment happened in place and now our DataFrame displays even more information than before. In addition to giving the set of columns a name, we can also rename a particular columns or columns ourselves if we wanted to with the method `.rename()`. Which takes in a dictionary with the old column name as the key and new one a the value.

In [96]:
df_la_pizza.rename(mapper={'new_pizzas': 'New Pizzas'}, axis=1, inplace=True)
df_la_pizza

pizza_attr,pizzas,price_pizza,pizzeria_location,New Pizzas
numbers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2,20,Sydney,7.0
1,1,16,Sydney,3.5
2,6,18,Seville,21.0
3,5,21,Perth,17.5
4,1,22,Perth,3.5
5,4,27,Melbourne,14.0
6,2,30,Sydney,7.0
7,6,21,Seville,21.0
8,2,22,Melbourne,7.0
9,1,17,Perth,3.5


Let's unpack what just happened.
- the `mapper=` argument takes in a dictionary with the old column name as the key and new column name as the value
- the `axis=` argument set to one indicates that we want the change to happen in the columns. The other option, which is the default as well, applies to the rows
- the `inplace=True` argument tells Python to keep the changes in the dataframe so that we don't have to reasign the dataframe back to its original variable

If we wanted to get rid of a column we don't need or want anymore, we can use `del` call of Python, just like we saw in the chapter of lists, arrays, and matrices in lesson 2.

For illustration purposes, let's delete the `New Pizzas` column we just renamed.

In [97]:
del df_la_pizza['New Pizzas']
df_la_pizza # notice that the column is now gone

pizza_attr,pizzas,price_pizza,pizzeria_location
numbers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2,20,Sydney
1,1,16,Sydney
2,6,18,Seville
3,5,21,Perth
4,1,22,Perth
5,4,27,Melbourne
6,2,30,Sydney
7,6,21,Seville
8,2,22,Melbourne
9,1,17,Perth


In [100]:
an_array = (23, 42, 56, 82, 90, 10)
an_array
np.sum(an_array)

303

In [103]:
a_matrix = an_array.reshape(3, 2)
a_matrix


AttributeError: 'tuple' object has no attribute 'reshape'

In [None]:
np.sum(a_matrix, axis=1)

In [104]:
df_la_pizza['pizzas']

numbers
0    2
1    1
2    6
3    5
4    1
5    4
6    2
7    6
8    2
9    1
Name: pizzas, dtype: int64

In [106]:
list_of_cols = ['pizzas', 'price_pizza']
two_cols.sum(axis=0)

NameError: name 'two_cols' is not defined

Let us look at how to convert a list of lists and tuples into a pandas DataFrame. We will first create a list called `la_pizzas` with lists and tuples, and then pass this matrix into our DataFrame constructor.

In [105]:
np.sum(two_cols)

NameError: name 'two_cols' is not defined

In [107]:
la_pizzas = [[2, 20, 'Sydney'],
            [1, 16, 'Sydney'],
            (6, 18, 'Seville'),
            [5, 21, 'Perth'],
            [1, 22, 'Perth'],
            (4, 27, 'Melbourne'),
            [2, 30, 'Sydney'],
            (6, 21, 'Seville'),
            [2, 22, 'Melbourne'],
            [1, 17, 'Perth']]
la_pizzas

[[2, 20, 'Sydney'],
 [1, 16, 'Sydney'],
 (6, 18, 'Seville'),
 [5, 21, 'Perth'],
 [1, 22, 'Perth'],
 (4, 27, 'Melbourne'),
 [2, 30, 'Sydney'],
 (6, 21, 'Seville'),
 [2, 22, 'Melbourne'],
 [1, 17, 'Perth']]

In [108]:
df_one = pd.DataFrame(data=la_pizzas, 
                      columns=['pizzas', 'price_pizza', 'pizzeria_location'])
df_one

Unnamed: 0,pizzas,price_pizza,pizzeria_location
0,2,20,Sydney
1,1,16,Sydney
2,6,18,Seville
3,5,21,Perth
4,1,22,Perth
5,4,27,Melbourne
6,2,30,Sydney
7,6,21,Seville
8,2,22,Melbourne
9,1,17,Perth


As you can see, because we didn't have any column names this time, we had to use the `columns=` argument with a list of the strings for the names we will like our columns to have. Otherwise, pandas would have numbered the column names and you could only imagine how difficult it might be to figure out the content of a column without a name on a lengthy dataset.

We can also add completely new lists to our existing DataFrame, and pandas will match the index of each element in our new list with the index of each element in our DataFrame. Let's see this in action.

In [109]:
new_pizza_code = list(range(20, 40, 2))
new_pizza_code

[20, 22, 24, 26, 28, 30, 32, 34, 36, 38]

In [110]:
df_one['new_pizza_code'] = new_pizza_code
df_one

Unnamed: 0,pizzas,price_pizza,pizzeria_location,new_pizza_code
0,2,20,Sydney,20
1,1,16,Sydney,22
2,6,18,Seville,24
3,5,21,Perth,26
4,1,22,Perth,28
5,4,27,Melbourne,30
6,2,30,Sydney,32
7,6,21,Seville,34
8,2,22,Melbourne,36
9,1,17,Perth,38


If the length of a list does not match that of our DataFrame, pandas will throw an error at us for the mismatched lenght.

In [119]:
another_list = list(range(40, 60, 2))
df_one['another_list'] = another_list
df_one

Unnamed: 0,pizzas,price_pizza,pizzeria_location,new_pizza_code,another_list
0,2,20,Sydney,20,40
1,1,16,Sydney,22,42
2,6,18,Seville,24,44
3,5,21,Perth,26,46
4,1,22,Perth,28,48
5,4,27,Melbourne,30,50
6,2,30,Sydney,32,52
7,6,21,Seville,34,54
8,2,22,Melbourne,36,56
9,1,17,Perth,38,58


Now let's see how to use numpy arrays and matrices to create a DataFrame. Let's begin with a matrix.

In [112]:
# We first create our la_pizza numpy matrix

la_pizza_np = np.array([[2, 20, 'Sydney'],
                        [1, 16, 'Sydney'],
                        [6, 18, 'Seville'],
                        [5, 21, 'Perth'],
                        [1, 22, 'Perth'],
                        [4, 27, 'Melbourne'],
                        [2, 30, 'Sydney'],
                        [6, 21, 'Seville'],
                        [2, 22, 'Melbourne'],
                        [1, 17, 'Perth']])

la_pizza_np

array([['2', '20', 'Sydney'],
       ['1', '16', 'Sydney'],
       ['6', '18', 'Seville'],
       ['5', '21', 'Perth'],
       ['1', '22', 'Perth'],
       ['4', '27', 'Melbourne'],
       ['2', '30', 'Sydney'],
       ['6', '21', 'Seville'],
       ['2', '22', 'Melbourne'],
       ['1', '17', 'Perth']], dtype='<U11')

In [113]:
# then we pass the matrix into the pd.DataFrame method and provide a list of names for the columns

df_np_pizza = pd.DataFrame(la_pizza_np, columns=['pizzas', 'price_pizza', 'pizzeria_location'])
df_np_pizza

Unnamed: 0,pizzas,price_pizza,pizzeria_location
0,2,20,Sydney
1,1,16,Sydney
2,6,18,Seville
3,5,21,Perth
4,1,22,Perth
5,4,27,Melbourne
6,2,30,Sydney
7,6,21,Seville
8,2,22,Melbourne
9,1,17,Perth


Another cool inherited trait from NumPy is that we can use the descriptive attributes we learned about in the previous lesson, which are `.dtypes`, `.shape`, and `.ndim`.

In [120]:
# notice the shape of our new dataframe

df_np_pizza.shape

(10, 3)

In [121]:
# check the types
df_np_pizza.dtypes

pizzas               object
price_pizza          object
pizzeria_location    object
dtype: object

In [127]:
df_np_pizza.astype({'pizzas':int, 'price_pizza':float})

Unnamed: 0,pizzas,price_pizza,pizzeria_location
0,2,20.0,Sydney
1,1,16.0,Sydney
2,6,18.0,Seville
3,5,21.0,Perth
4,1,22.0,Perth
5,4,27.0,Melbourne
6,2,30.0,Sydney
7,6,21.0,Seville
8,2,22.0,Melbourne
9,1,17.0,Perth


In [128]:
df_np_pizza.dtypes

pizzas               object
price_pizza          object
pizzeria_location    object
dtype: object

In [129]:
# check the dimension
df_np_pizza.ndim

2

It is important to note that creating a matrix where the row arrays represent three different columns is not the same as creating three arrays that represent three rows and and 10 columns. Our intuition might betray us in this instance.

Let's look at an example with fake weather data where we pass in three arrays to a NumPy array that should represent the same DataFrame as the one above, but with different data now, of course.

In [130]:
weather_np = np.random.randint(10, 45, 10)
weather_np

array([41, 26, 16, 20, 17, 24, 40, 28, 16, 11])

In [131]:
cities = ['Sydney', 'Sydney', 'Seville', 'Perth', 'Perth', 
          'Melbourne', 'Sydney', 'Seville', 'Melbourne', 'Perth']
cities

['Sydney',
 'Sydney',
 'Seville',
 'Perth',
 'Perth',
 'Melbourne',
 'Sydney',
 'Seville',
 'Melbourne',
 'Perth']

In [132]:
days = np.random.randint(10, 30, 10)
days

array([26, 23, 28, 24, 11, 27, 22, 10, 14, 19])

In [133]:
data_weather = np.array([weather_np,
                         cities,
                         days])
data_weather

array([['41', '26', '16', '20', '17', '24', '40', '28', '16', '11'],
       ['Sydney', 'Sydney', 'Seville', 'Perth', 'Perth', 'Melbourne',
        'Sydney', 'Seville', 'Melbourne', 'Perth'],
       ['26', '23', '28', '24', '11', '27', '22', '10', '14', '19']],
      dtype='<U11')

Notice the shape of our new matrix. What do you think will happen when we pass it through our DataFrame constructor?

In [134]:
pd.DataFrame(data=data_weather, columns=['weather', 'cities', 'days'])

ValueError: Shape of passed values is (3, 10), indices imply (3, 3)

The tricky part of using NumPy arrays lies in that the arrays are interpreted as horizontal arrows, meaning, we would have 10 columns and 3 rows if we were to use our array with its current shape. You probably already noticed this by running the code above.

The solution is to transpose our matrix and shift the columns to the rows and the rows to the columns. NumPy provides a very nice way for doing this. By adding the method `.T` attribute at the end of any array or matrix you can transpose it into a different shape (e.g. reshape it).

Let's see what this looks like and then use it to create our new DataFrame.

In [135]:
# same list as before with the pizzas😎

data_weather.T

array([['41', 'Sydney', '26'],
       ['26', 'Sydney', '23'],
       ['16', 'Seville', '28'],
       ['20', 'Perth', '24'],
       ['17', 'Perth', '11'],
       ['24', 'Melbourne', '27'],
       ['40', 'Sydney', '22'],
       ['28', 'Seville', '10'],
       ['16', 'Melbourne', '14'],
       ['11', 'Perth', '19']], dtype='<U11')

In [136]:
pd.DataFrame(data=data_weather.T, columns=['weather', 'cities', 'days'])

Unnamed: 0,weather,cities,days
0,41,Sydney,26
1,26,Sydney,23
2,16,Seville,28
3,20,Perth,24
4,17,Perth,11
5,24,Melbourne,27
6,40,Sydney,22
7,28,Seville,10
8,16,Melbourne,14
9,11,Perth,19


Lastly, imagine we had several pandas Series representing different values but with similar indexes. If we wanted to combine all of these into a single DataFrame to use these Series in combination, we could do so with `pd.concat([Series1, Series2, Series3])`, or with `pd.DataFrame(data=dictionary)` where the keys of the dictionary would represent the variables (and the names the columns will take) in the DataFrame, and the values would be the pandas Series (e.g. the elements of the columns) you will be using in your DataFrame.

One important thing to keep in mind is that, just like with the `np.concatenate` function we saw on the last lesson, you will need to pick an axis when using this method.

**Note:** pandas will try to match the indexes of your multiple Series when combining their elements, but, if the indexes do not match, it will add an `np.nan` (Not a Number) at that place to show that a particular element does not exist.

In [137]:
# let's start with two series
series_one = pd.Series(np.random.randint(0, 20, 20), name='random_nums')
series_two = pd.Series(list(range(20, 60, 2)), name="two_steps")
print(series_one, '\n', series_two)

0     16
1      4
2      8
3     13
4      2
5      9
6     17
7     13
8     15
9     10
10     2
11    15
12    11
13     7
14    17
15     5
16    18
17    13
18     0
19     3
Name: random_nums, dtype: int32 
 0     20
1     22
2     24
3     26
4     28
5     30
6     32
7     34
8     36
9     38
10    40
11    42
12    44
13    46
14    48
15    50
16    52
17    54
18    56
19    58
Name: two_steps, dtype: int64


In [138]:
df_of_series = pd.concat([series_one, series_two], axis=1)
df_of_series

Unnamed: 0,random_nums,two_steps
0,16,20
1,4,22
2,8,24
3,13,26
4,2,28
5,9,30
6,17,32
7,13,34
8,15,36
9,10,38


As noted above, the concatenation happens at the index level and both columns have been merged into a single dataframe. Let's now see what happens with different indexes that are not numerical.

In [139]:
# let's start with two series
series_three = pd.Series(np.linspace(0, 3, 10), index=list('abcdefghij'), name='random_nums')
series_four = pd.Series(list(range(50, 70, 2)), index=list('abcdefghij'), name="two_steps")
print(series_three, '\n', series_four)

a    0.000000
b    0.333333
c    0.666667
d    1.000000
e    1.333333
f    1.666667
g    2.000000
h    2.333333
i    2.666667
j    3.000000
Name: random_nums, dtype: float64 
 a    50
b    52
c    54
d    56
e    58
f    60
g    62
h    64
i    66
j    68
Name: two_steps, dtype: int64


In [140]:
df_of_series = pd.concat([series_three, series_four], axis=1)
df_of_series

Unnamed: 0,random_nums,two_steps
a,0.0,50
b,0.333333,52
c,0.666667,54
d,1.0,56
e,1.333333,58
f,1.666667,60
g,2.0,62
h,2.333333,64
i,2.666667,66
j,3.0,68


We are still able to concatenate at the index level on matching letters, which is what we'd like. So let's now examine what would happend if we don't have the same amount of elements in both Series' we are trying to concatenate.

In [141]:
# let's start with two series
series_five = pd.Series(np.linspace(0, 3, 8), index=list('abcdehij'), name='random_nums')
series_six = pd.Series(list(range(50, 70, 2)), index=list('abcdefghij'), name="two_steps")
print(series_five, '\n', series_six)

a    0.000000
b    0.428571
c    0.857143
d    1.285714
e    1.714286
h    2.142857
i    2.571429
j    3.000000
Name: random_nums, dtype: float64 
 a    50
b    52
c    54
d    56
e    58
f    60
g    62
h    64
i    66
j    68
Name: two_steps, dtype: int64


In [142]:
df_of_series = pd.concat([series_five, series_six], axis=1)
df_of_series

Unnamed: 0,random_nums,two_steps
a,0.0,50
b,0.428571,52
c,0.857143,54
d,1.285714,56
e,1.714286,58
h,2.142857,64
i,2.571429,66
j,3.0,68
f,,60
g,,62


We get what is called an NaN value, which stands for `Not a Number`. It is a special value assigned to missing values which we will learn how to deal with very soon in the cleaning notebook.

Let's now examine how to create a pandas dataframe from a dictionary of Series.

In [143]:
# same approach as above but with dictionaries of Series's

dict_of_series = {
    'random_nums': pd.Series(np.random.randint(0, 20, 20), name='random_nums'),
    'two_steps': pd.Series(list(range(20, 60, 2)), name="two_steps")
}
dict_of_series

{'random_nums': 0     10
 1     17
 2     11
 3      2
 4     12
 5     18
 6      1
 7      1
 8     15
 9     17
 10     1
 11    17
 12     2
 13    16
 14    12
 15     6
 16    16
 17     0
 18     8
 19     3
 Name: random_nums, dtype: int32,
 'two_steps': 0     20
 1     22
 2     24
 3     26
 4     28
 5     30
 6     32
 7     34
 8     36
 9     38
 10    40
 11    42
 12    44
 13    46
 14    48
 15    50
 16    52
 17    54
 18    56
 19    58
 Name: two_steps, dtype: int64}

In [None]:
dict_of_series['random_nums']

In [144]:
# we can use a regular dataframe call for this one
df_dict_series = pd.DataFrame(dict_of_series)
df_dict_series

Unnamed: 0,random_nums,two_steps
0,10,20
1,17,22
2,11,24
3,2,26
4,12,28
5,18,30
6,1,32
7,1,34
8,15,36
9,17,38


# Exercise 4

1. Create two pandas Series with ones and zeros, respectively
2. add 5 to the one with zeros and assign it to a new variable,
3. add 8 to the one with ones and it to a new variable
4. add the two previous Series together, and assign the result to a variable.
2. Create two pandas Series with 5 random numbers each, and 3 countries as the index of each. Only two countries should match in the Series's indeces.

In [9]:
zeros_panda = pd.Series(np.zeros(10, dtype='int8'))
zeros_panda

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int8

In [11]:
zero_plus5_panda = zeros_panda + 5
print(zero_plus5_panda)

0    5
1    5
2    5
3    5
4    5
5    5
6    5
7    5
8    5
9    5
dtype: int8


In [12]:
ones_panda = pd.Series(np.ones((10), dtype='int8'))
one_plus8_panda = ones_panda + 8
print(one_plus8_panda)


0    9
1    9
2    9
3    9
4    9
5    9
6    9
7    9
8    9
9    9
dtype: int8


In [15]:
countries_a = pd.Series(np.random.randint(50, 100, 5),  index=['DR'. 'India', 'Denmark', 'DR', 'India'])
countries_a

countries_a = pd.Series(np.random.randint(50,100,5), index=['DR','India','Denmark','DR','India'])
countries_a


#random_one = pd.Series(data=np.random.randint(0,10,5, dtype='int8'),
#index = ['Australia', 'Denmark', 'Japan', 'Denmark', 'Japan'], name='random_data')
#random_one
#just need to double this - answer from Malan

DR         78
India      93
Denmark    60
DR         87
India      75
dtype: int32

# Exercise 5

1. Create a DataFrame using pandas lists of lists.
2. Create a DataFrame using a NumPy matrix with different data types in each column.
3. Perform a computation with two columns and add the result as a new column in the DataFrame. (e.g. add or subtract two columns to create a third one.)

In [29]:
# initialize list of lists
data = [['Julian', 37], ['Luke', 33], ['Matt', 35]]
data

[['Julian', 37], ['Luke', 33], ['Matt', 35]]

In [30]:
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df )

     Name  Age
0  Julian   37
1    Luke   33
2    Matt   35


In [31]:
#Populate Dataframe with data
dtfr = pd.DataFrame(np.array([[1, 2, 3, 4],
['Homebush','Summer Hill','Croydon','Pyrmont'],
[11.34, 14.76, 19.74, 7.56]])).T
dtfr.columns=['Order', 'Suburb', 'Density']
dtfr

Unnamed: 0,Order,Suburb,Density
0,1,Homebush,11.34
1,2,Summer Hill,14.76
2,3,Croydon,19.74
3,4,Pyrmont,7.56


In [45]:
dtfr = dtfr.astype({'Order':int, 'Density': float})

In [47]:
dtfr['Sum Column'] = dtfr['Order'] + dtfr['Density']
dtfr

Unnamed: 0,Order,Suburb,Density,Sum Column
0,1,Homebush,11.34,12.34
1,2,Summer Hill,14.76,16.76
2,3,Croydon,19.74,22.74
3,4,Pyrmont,7.56,11.56


In [48]:
dtfr


Unnamed: 0,Order,Suburb,Density,Sum Column
0,1,Homebush,11.34,12.34
1,2,Summer Hill,14.76,16.76
2,3,Croydon,19.74,22.74
3,4,Pyrmont,7.56,11.56


# Exercise 6

1. Crate 3 Series with 15 linearly spaced numbers in each, and a index based on letters for each.
2. Concatenate the 3 Series you just create into a DataFrame.
3. Sum an entire column and assign the result to a variable call `one_num'.
4. Get the average of an entire column and assign the result to a variable call `one_mean'.

In [33]:
series1 = pd.Series(data=np.linspace(3, 60, 15), index=list('yayitissaturday'))
# Next 2 from Franco in class
series2 = pd.Series(data=np.linspace(16,226,15), index = list('yayitissaturday'))
series3 = pd.Series(data=np.linspace(-19,3,15), index = list('yayitissaturday'))
print(series1)

y     3.000000
a     7.071429
y    11.142857
i    15.214286
t    19.285714
i    23.357143
s    27.428571
s    31.500000
a    35.571429
t    39.642857
u    43.714286
r    47.785714
d    51.857143
a    55.928571
y    60.000000
dtype: float64


In [34]:
print(series2)

y     16.0
a     31.0
y     46.0
i     61.0
t     76.0
i     91.0
s    106.0
s    121.0
a    136.0
t    151.0
u    166.0
r    181.0
d    196.0
a    211.0
y    226.0
dtype: float64


In [35]:
print(series3)

y   -19.000000
a   -17.428571
y   -15.857143
i   -14.285714
t   -12.714286
i   -11.142857
s    -9.571429
s    -8.000000
a    -6.428571
t    -4.857143
u    -3.285714
r    -1.714286
d    -0.142857
a     1.428571
y     3.000000
dtype: float64


In [36]:
# Did not work with different inxex in series 1
seriesconcat = pd.concat([series1, series2, series3], axis=1)
seriesconcat

Unnamed: 0,0,1,2
y,3.0,16.0,-19.0
a,7.071429,31.0,-17.428571
y,11.142857,46.0,-15.857143
i,15.214286,61.0,-14.285714
t,19.285714,76.0,-12.714286
i,23.357143,91.0,-11.142857
s,27.428571,106.0,-9.571429
s,31.5,121.0,-8.0
a,35.571429,136.0,-6.428571
t,39.642857,151.0,-4.857143


In [37]:
one_num = seriesconcat.sum(0)
one_num

0     472.5
1    1815.0
2    -120.0
dtype: float64

In [38]:
one_mean = seriesconcat.mean(0)
one_mean

0     31.5
1    121.0
2     -8.0
dtype: float64

In [39]:
#Exercise 6 Ramon's version
series1 = pd.Series(data=np.linspace(14,20,15), index = list('yayitissaturday'))
series2 = pd.Series(data=np.linspace(16,226,15), index = list('yayitissaturday'))
series3 = pd.Series(data=np.linspace(-19,3,15), index = list('yayitissaturday'))
print(series1)

y    14.000000
a    14.428571
y    14.857143
i    15.285714
t    15.714286
i    16.142857
s    16.571429
s    17.000000
a    17.428571
t    17.857143
u    18.285714
r    18.714286
d    19.142857
a    19.571429
y    20.000000
dtype: float64


In [40]:
print(series2)

y     16.0
a     31.0
y     46.0
i     61.0
t     76.0
i     91.0
s    106.0
s    121.0
a    136.0
t    151.0
u    166.0
r    181.0
d    196.0
a    211.0
y    226.0
dtype: float64


In [41]:
print(series3)

y   -19.000000
a   -17.428571
y   -15.857143
i   -14.285714
t   -12.714286
i   -11.142857
s    -9.571429
s    -8.000000
a    -6.428571
t    -4.857143
u    -3.285714
r    -1.714286
d    -0.142857
a     1.428571
y     3.000000
dtype: float64


In [42]:
seriesconcat = pd.concat([series1, series2, series3], axis=1)
seriesconcat

Unnamed: 0,0,1,2
y,14.0,16.0,-19.0
a,14.428571,31.0,-17.428571
y,14.857143,46.0,-15.857143
i,15.285714,61.0,-14.285714
t,15.714286,76.0,-12.714286
i,16.142857,91.0,-11.142857
s,16.571429,106.0,-9.571429
s,17.0,121.0,-8.0
a,17.428571,136.0,-6.428571
t,17.857143,151.0,-4.857143


In [43]:
one_num = seriesconcat[0].sum(0)
one_num

255.0

In [44]:
one_mean = seriesconcat[2].mean(0)
one_mean

-8.000000000000002

# 2.3 Accessing and Selecting Data

To access and select data in a pandas DataFrame we can use the same tools we learned in lesson 2, `array[start:stop:step, start:stop:step]` for rows and columns, fancy indexing, and masking for n-dimensional arrays. pandas also provides us with two additional tools for accessing data inside a DataFrame `df.loc[]` and `df.iloc[]`.

- `df.loc[]` helps us select data in the same manner as with NumPy arrays except that we need to select the columns by their names and not by their numbers.
- `df.iloc[]` allows to select rows and columns by numbers. For example, if I have the columns `[weather, cities, days]`, I could select weather with index 0, cities with index 1, and days with index 2, that same way we would do it with NumPy. One caveat of this method is that regardless of were the index start (e.g. 10 to 20), or how it is represented (e.g. a, b, c, d), it will start counting from 0.

Let's look at the regular way first.

In [None]:
# our previous weather dataframe
df_weather = pd.DataFrame(data=data_weather.T, columns=['weather', 'cities', 'days'])
df_weather

In [None]:
# regular indexing
df_weather[3:5]

In [None]:
# masking
df_weather[df_weather['cities'] == 'Perth']

In [None]:
# more masking or fancy indexing
df_weather.loc[df_weather['cities'] == 'Sydney']

In [None]:
df_weather.loc[df_weather['cities'] == 'Sydney', 'days']

This is a great, quick and dirty approach, but if we wanted to get more granular with how we select our data, we would have to resort to the additional functionality of `.iloc[]` and `.loc[]`. Namely, selecting what we want and how with want it with the rows and columns of our dataset.

It is important to note that `.loc[]` includes the end point of a slice but `.iloc[]` does not. Meaning, `df.loc[:10]` would actually select the element at index 10 as well. The same would apply with the columns.

Let's look at pandas methods for slicing and dicing.

In [None]:
# select the first 7 rows of the days column

df_weather.loc[0:7, 'days']

In [None]:
# select the first 5 rows of the days and weather columns

df_weather.loc[0:5, ['weather', 'days']]

In [None]:
# same as before but with iloc now and integers

df_weather.iloc[0:7, 2]

In [None]:
df_weather.iloc[0:5, [0, 2]]

In [None]:
df_weather.iloc[0:-2, 1:]

In [None]:
# multiple steps iloc
df_weather.iloc[::2, 1:]

In [None]:
# multiple steps loc
df_weather.loc[[3, 5, 7], ::2]

Both of these methods, `.iloc[]` and `.loc[]` will become extremely useful as we move along the course and our data analytics journey. A good tip for remembering the differences between the two is to always think of integers when you see the i in `.iloc[]`.

# Exercise 7

1. Crate a numpy array with 100 random integers. Assign it to the variable `one_hundred_nums`. 
2. Reshape `one_hundred_nums` into a 10 by 10 matrix. Assing it the variable name `matrix`.
3. With `matrix` create a pandas DataFrame and call it `df`. Make the columns different upper-case letters.
4. Using `.iloc[]` select the first 5 rows and the last 5 columns. Assign it to the var my `islice`.
5. Using `.loc[]` select the every other row starting from the third one, and every other column starting from the second one. Assign it to the var my `dos_slice`.

# Awesome Work - Head to Notebook 06

![great work](https://media.giphy.com/media/xTQXENNW77nUsNVmV4/giphy.gif)