# Introduction 

Throughout this entire notebook you should be experimenting with the code in the non-text cells. A great way to begin to get a feel for Python is by playing with it. So have some fun by changing the values in the cells and then running them again with Shift-Enter. Before you do, think about what you expect the output to be, and make sure your intuition matches up with what you run. If it doesn't, take some time to think about what happened so you can hone your intuition.

At the end of each section there will be some questions to help further your understanding. Remember, in Python we can always manually test code by running it; however, you should try to think about the answers to these questions before you run some code. This way you can check and verify your understanding of the section's topic.


## Grabbing your data - Part 1

#### The Basics

We now know how to look at our data. What if we wanted to grab certain parts to look at, or certain parts to play around with/transform? Say we wanted to grab an entire row, or an entire column... how do we do that? Let's dive in by starting off with some indexing. 

The format we use to index into our dataframe and grab data will depend on exactly what subset of the data that we want to grab. If we want to grab entire rows or columns, then we can use bracket notation to do that (just like we use bracket notation to index into lists). If we want an entire column, then we're going to place the **column name** in brackets (and multiple column names in a list inside those brackets). We can also sometimes access a column via dot notation on the dataframe, which we'll show in a second. If we want an entire row, then we have to place **both** a **beginning and ending index** inside the brackets (it won't work to just place a single index in the brackets). 

In [1]:
import pandas as pd
df = pd.read_csv('../data/winequality-red.csv', delimiter=';')

In [2]:
# Let's take a quick look at the DataFrame that we're using to remind ourselves what it 
# looks like. 
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [3]:
df['chlorides'] # Grabs the 'chlorides' column. 
df.chlorides # Also grabs the 'chlorides' column. 

0       0.076
1       0.098
2       0.092
3       0.075
4       0.076
5       0.075
6       0.069
7       0.065
8       0.073
9       0.071
10      0.097
11      0.071
12      0.089
13      0.114
14      0.176
15      0.170
16      0.092
17      0.368
18      0.086
19      0.341
20      0.077
21      0.082
22      0.106
23      0.084
24      0.085
25      0.080
26      0.080
27      0.106
28      0.080
29      0.082
        ...  
1569    0.056
1570    0.230
1571    0.038
1572    0.069
1573    0.075
1574    0.074
1575    0.060
1576    0.081
1577    0.076
1578    0.118
1579    0.053
1580    0.068
1581    0.053
1582    0.053
1583    0.074
1584    0.061
1585    0.066
1586    0.065
1587    0.066
1588    0.068
1589    0.073
1590    0.077
1591    0.089
1592    0.076
1593    0.068
1594    0.090
1595    0.062
1596    0.076
1597    0.075
1598    0.067
Name: chlorides, dtype: float64

In [4]:
df['volatile acidity']
df.volatile acidity # Dot notation only works if the column name has no spaces. 

SyntaxError: invalid syntax (<ipython-input-4-a737249bdd33>, line 2)

We can, however, alter the column headers to remove the spaces, at which point dot notation would work.  
The following code shows how we can quickly, and efficiently, eliminate spaces from the column names using list comprehension:

In [5]:
df2 = df.copy()
cols = df2.columns.tolist()
cols = [col.replace(' ', '_') for col in cols]
df2.columns = cols
df2.volatile_acidity

0       0.700
1       0.880
2       0.760
3       0.280
4       0.700
5       0.660
6       0.600
7       0.650
8       0.580
9       0.500
10      0.580
11      0.500
12      0.615
13      0.610
14      0.620
15      0.620
16      0.280
17      0.560
18      0.590
19      0.320
20      0.220
21      0.390
22      0.430
23      0.490
24      0.400
25      0.390
26      0.410
27      0.430
28      0.710
29      0.645
        ...  
1569    0.510
1570    0.360
1571    0.380
1572    0.690
1573    0.580
1574    0.310
1575    0.520
1576    0.300
1577    0.700
1578    0.670
1579    0.560
1580    0.350
1581    0.560
1582    0.715
1583    0.460
1584    0.320
1585    0.390
1586    0.310
1587    0.610
1588    0.660
1589    0.725
1590    0.550
1591    0.740
1592    0.510
1593    0.620
1594    0.600
1595    0.550
1596    0.510
1597    0.645
1598    0.310
Name: volatile_acidity, dtype: float64

In [6]:
# We can access all of multiple columns by passing in a list of column names. 
df[['chlorides', 'volatile acidity']]

Unnamed: 0,chlorides,volatile acidity
0,0.076,0.700
1,0.098,0.880
2,0.092,0.760
3,0.075,0.280
4,0.076,0.700
5,0.075,0.660
6,0.069,0.600
7,0.065,0.650
8,0.073,0.580
9,0.071,0.500


In [7]:
df[:3] # This will grab from the beginning up to but not including the row at index 3. 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5


In [8]:
# This will grab up to but not including the row at index 1 (i.e. it'll grab the row  at index 0). 
df[:1]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [9]:
# This will not work because we didn't give it a starting **and** ending index.
df[0]

KeyError: 0

In [10]:
# This won't work because we are trying to access a subset of rows 
# **and** columns at the same time. 
df[:1, 'volatile acidity'] 

TypeError: unhashable type

##### Data Grabbing Questions Part 1

1. How would we grab the `density` column from the `DataFrame` above?
2. How would we grab both the `density` and `sulphates` columns from the `DataFrame` above?
3. How would we grab row `252` from the `DataFrame` above?
4. How would we grab rows `252-454` from the `DataFrame` above?

## Grabbing your data - Part 2

What if we want to grab certain rows **and** certain columns, rather than just entire rows or entire columns?

If we want to grab only certain rows and columns, there are three **methods** that we can use to index into a Pandas DataFrame: `loc[]`, `iloc[]`, and `ix[]`. Note that these are **methods**, which means that we will call them via dot notation on our `DataFrame` object. The difference between these three has to do with how we use them. `loc[]` is a purely label-location based indexer, `iloc[]` is a purely integer-location based indexer, and `ix[]` is a primarily label-location based indexer that falls back to integer indexing.

Because of the strict restrictions on the use of `loc[]` and `iloc[]`, I typically almost always use `ix[]`. It's much more flexible. 

In [11]:
# Let's look at our data real quickly again. 
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


In [12]:
# Loc is label based. All of these will work, because they are recognized as labels on the 
# rows (index labels) or columns (column name labels). 
df.loc[0, 'fixed acidity'] # 0 is one of the index labels, and 'fixed acidity' is a column label.

7.4000000000000004

In [13]:
# Ranges on our index labels still work (as long as they're numeric).
df.loc[0:10, 'fixed acidity']

0      7.4
1      7.8
2      7.8
3     11.2
4      7.4
5      7.4
6      7.9
7      7.3
8      7.8
9      7.5
10     6.7
Name: fixed acidity, dtype: float64

In [14]:
df.loc[10:15, ['chlorides', 'fixed acidity']]

Unnamed: 0,chlorides,fixed acidity
10,0.097,6.7
11,0.071,7.5
12,0.089,5.6
13,0.114,7.8
14,0.176,8.9
15,0.17,8.9


In [15]:
# These will all fail, because they attempt to access the columns by position integers, 
# and loc only takes labels. 
df.loc[0, 0]
df.loc[0:10, 0]
df.loc[10:15, [0, 4]]

KeyError: 'the label [0] is not in the [index]'

In [16]:
# The above will all work with .iloc, though, since it takes integers (and not labels)
df.iloc[0, 0]
df.iloc[0:10, 0]
df.iloc[10:15, [0, 4]]

Unnamed: 0,fixed acidity,chlorides
10,6.7,0.097
11,7.5,0.071
12,5.6,0.089
13,7.8,0.114
14,8.9,0.176


In [17]:
# Using labels, though, like we did with .loc, will NOT work. These will all fail
df.iloc[0, 'fixed acidity']
df.iloc[0:10, 'fixed acidity'] 
df.iloc[10:15, ['chlorides', 'fixed acidity']]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [18]:
# Note that anything we have tried above will work with ix[]. It's because of this flexibility 
# that I typically always use ix[]. 
df.ix[0, 'fixed acidity']
df.ix[0:10, 'fixed acidity'] 
df.ix[10:15, ['chlorides', 'fixed acidity']]
df.ix[0, 0]
df.ix[0:10, 0]
df.ix[10:15, [0, 4]]

Unnamed: 0,fixed acidity,chlorides
10,6.7,0.097
11,7.5,0.071
12,5.6,0.089
13,7.8,0.114
14,8.9,0.176
15,8.9,0.17


##### Data Grabbing Questions Part 2

1. Using `.loc`, how would we grab the `pH` value at index `10`?
2. Using `.iloc`, how would we grab the values from indices `10-15` from the `5th` column?
3. How would we use `.ix` to re-evaluate `1` and `2`?