# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to Pandas 2
Week 2 | Day 2

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Use the .loc function to slice
- Set and reset DataFrame Indices
- Use .isin()
- Perform boolean indexing on dataframes
- Perform math functions on pandas Series

## Recap of yesterday

### Importing pandas:
```python
import pandas as pd
```

### Reading in a csv: 

```python
df = pd.read_csv()
```

### Viewing the head and tail

```python
df.head()
df.tail()
```

### Viewing the columns and the index
```python
df.columns
df.index
```

### Data type info
```python
df.info()
```

### Summary statistics
```python
df.describe()
df['some_column_name'].max()
df['some_column_name'].min()
```

### Uniques
```python
df['some_column_name'].nunique()
df['some_column_name'].unique()
```

### Creating a histogram
```python
df['some_column_name'].hist()
```

## Slicing with .iloc

In [256]:
# import pandas
import pandas as pd

# create a dictionary
numbs = {'ones': [1, 2, 3, 4], 'tens': [10, 20, 30, 40], 'hundos': [100, 200, 300, 400]}

# pass dictionary into DataFrame and set the columns to keep the order
df = pd.DataFrame(numbs, columns=['ones', 'tens', 'hundos'])

In [257]:
df

Unnamed: 0,ones,tens,hundos
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400


In [258]:
df.columns

Index([u'ones', u'tens', u'hundos'], dtype='object')

### Slicing with .iloc


In [259]:
df.iloc[:3, 0:2]

Unnamed: 0,ones,tens
0,1,10
1,2,20
2,3,30


### How do we get column 1?

In [260]:
# solution

x = df.iloc[:, 0]

In [261]:
x

0    1
1    2
2    3
3    4
Name: ones, dtype: int64

In [262]:
df.iloc[:, 0:1]

Unnamed: 0,ones
0,1
1,2
2,3
3,4


In [263]:
df.iloc[:, 0].to_frame()

Unnamed: 0,ones
0,1
1,2
2,3
3,4


### How do we turn it into a DataFrame?

In [264]:
tdf = pd.DataFrame(x)

In [265]:
tdf

Unnamed: 0,ones
0,1
1,2
2,3
3,4


## How do we get the first  3 rows and 2 columns?

In [266]:
df.iloc[:3, :2]

Unnamed: 0,ones,tens
0,1,10
1,2,20
2,3,30


### Notice this is exclusive - we get 0, 1, 2 for rows and 0,1 for columns

### One more. How do we get all rows and column 0 and 2?

In [267]:
df.iloc[:, [0,2]]

Unnamed: 0,ones,hundos
0,1,100
1,2,200
2,3,300
3,4,400


### Can we get only columns with an 'o' in them with iloc?

In [268]:
df.iloc[:, [x for x in df.columns if 'o' in x]]

TypeError: cannot perform reduce with flexible type

## No.

## Introducing .loc

In [None]:
df.loc[:, [x for x in df.columns if 'o' in x]]

In [None]:
y =df.loc [:, [x for x in df.columns if 'o' in x]]
y

### .loc allows us to use combined slicing: numeric and named

In [None]:
df.loc[:3, ['ones', 'tens']]

### Notice that it is inclusive! It includes the last item listed

### Notice we can't use the .iloc syntax with numerics

In [None]:
df.loc[:2, 1]

## Let's change the index

In [None]:
from string import ascii_lowercase

# change the index to be lower case letters
df.index = [x for x in ascii_lowercase[:len(df.index)]]

In [None]:
df

In [None]:
len(df.index)

In [None]:
df.index

## Now we'll use .loc to slice with a named index

In [None]:
# list of indicies
df.loc[['a','c'], :]

In [None]:
# slicing with named indices
df.loc['b':, :]

In [None]:
states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

In [None]:
state_df = pd.DataFrame([states.keys(),\
              states.values(),\
              [len(x) for x in states.values()]],\
              index=['abbreviation', 'name', 'name_length']).T

## DataFrame of state names

In [None]:
# we have the index as what currently?
state_df

## Let's change the index

We have numbers as the index, let's make the index the state's abbreviation

In [None]:
state_df.set_index('abbreviation')

## Notice that was just a view, did not change the data

In [None]:
state_df

## Have to save as a new DataFrame or use inplace=True

In [None]:
state_df.set_index('abbreviation', inplace=True)

## Now we can see the changes 'stuck'

In [None]:
state_df

## What if we want to go back?

In [None]:
state_df

## Need to reset it!

In [None]:
state_df.reset_index(inplace=True)

In [None]:
state_df

## Exercise

Using the states_df:
- set the index to the state name and save it as new_state
- use the .loc method to select all the states that begin with the letter 'N'
- reset the index back to a zero-based index and do so inplace

In [None]:
state_df.set_index('name', inplace=True)

In [None]:
state_df.iloc[1,1]

In [None]:
state_df.loc['Wisconsin', 'name_length']

In [None]:
state_df.loc['Washington':'Wisconsin', 'name_length']

In [None]:
state_df.loc['Washington':'Nevada', 'name_length']

In [None]:
state_df['name_length'][0:14]

In [None]:
state_df.reset_index(inplace=True)

In [None]:
new_state_df = state_df.set_index('name')
n_states = [x for x in new_state_df.index if 'C' in x[0]]
n_state_df = new_state_df.loc[n_states, :]
n_state_df

In [None]:
state_df.index

In [None]:
state_df.reset_index(inplace=True)

In [None]:
state_df.index

In [None]:
state_df.iloc(1,2)

## Using .isin()

In [None]:
state_df

## We get a boolean (True/False) Series back

In [None]:
states_with_direction = ['North Dakota', 'North Carolina',\
                         'South Carolina', 'South Dakota',\
                         'West Virginia']

state_df['name'].isin(states_with_direction)

## Wrapping that in state_df[ ] gives us only the True rows

In [None]:
state_df[state_df['name'].isin(states_with_direction)]

In [None]:
state_df[state_df['name'].isin(states_with_direction)].reset_index()

In [None]:
###drop=True drops new index
state_df[state_df['name'].isin(states_with_direction)].reset_index(drop=True, inplace=True)

## Exercise

Using the state_df DataFrame:
- use .isin() to select only those rows that have a name_length of 10 or 12 characters
- use another .isin() with a list comprehension to select only the columns that are abbreviated that begin with an 'N' or a 'S'

In [None]:
n = [10,12]
state_df[state_df['name_length'].isin(n)]

In [None]:
state_df[state_df['name_length'].isin([10,12])]

In [None]:
ns = ['N', 'S']
state_df[state_df['abbreviation'].isin([x for x in state_df['abbreviation'] if x[0] in ns])]

In [None]:
state_df[state_df['abbreviation'].isin([x for x in state_df['abbreviation'] if x[0] == 'N' or x[0] == 'S'])]

## Futher into Boolean indexing

In [None]:
state_df['name_length'] > 10

In [None]:
state_df[state_df['name_length'].isin([x for x in state_df['name_length'] if x > 10])]

In [None]:
new_state_df = state_df[state_df['name_length'].isin([x for x in state_df['name_length'] if x > 10])]

In [None]:
new_state_df

## Again, we can wrap that

In [None]:
# gives us the only rows that have a length greater than 10
state_df[state_df['name_length']>10]

## Another example

In [None]:
state_df[state_df['name'].str.contains('South')]

## Let's use an 'and' here to combine requirements

In [None]:
state_df[(state_df['name_length']>12)\
          &(state_df['name'].str.contains('South'))]

## Let's use an 'or' statement

In [None]:
state_df[(state_df['name'].str.contains('North'))\
          |(state_df['name'].str.contains('South'))]

In [None]:
state_df[((state_df['name'].str.contains('North'))\
          |(state_df['name'].str.contains('South'))) & (state_df['name_length'] > 12)]

## Exercise

Using the state_df DataFrame:
- use Boolean indexing to select all states with a y in their name
- using the same code from the line above add another condition to only return states that have 10 or fewer characters in their name

In [None]:
state_df[state_df['name'].str.contains('y')]

In [None]:
state_df[(state_df['name'].str.contains('y')) & (state_df['name_length'] <= 10)]

## Math with pandas columns

In [None]:
state_df

## Let's add a new column that is 100x name_length

In [None]:
tmp_df = state_df.copy()

tmp_df['name_length_x100'] = tmp_df['name_length'] * 100

tmp_df

## Let's add two columns together

In [None]:
tmp_df['name_added_cols'] = tmp_df['name_length'] + tmp_df['name_length_x100']

tmp_df

## Exercise

Using the state_df DataFrame again:
- Save temp_df as a copy of state_df
- Double the name_length column by adding it to itself
- Double the doubled column you created by multiplying it by 2

In [None]:
temp_df = state_df.copy()
temp_df

In [None]:
temp_df['double_name'] = temp_df['name_length'] * 2
temp_df.iloc[0:3, :]

In [248]:
temp_df.index

RangeIndex(start=0, stop=50, step=1)

In [269]:
###
temp_df.drop(0)

Unnamed: 0,abbreviation,name,name_length,double_name
0,WA,Washington,10,20
1,WI,Wisconsin,9,18
2,WV,West Virginia,13,26
3,FL,Florida,7,14
4,WY,Wyoming,7,14
5,NH,New Hampshire,13,26
6,NJ,New Jersey,10,20
7,NM,New Mexico,10,20
8,NC,North Carolina,14,28
9,ND,North Dakota,12,24


In [254]:
temp_df.drop(['double_name'], axis=1)


Unnamed: 0,abbreviation,name,name_length
0,WA,Washington,10
1,WI,Wisconsin,9
2,WV,West Virginia,13
3,FL,Florida,7
4,WY,Wyoming,7
5,NH,New Hampshire,13
6,NJ,New Jersey,10
7,NM,New Mexico,10
8,NC,North Carolina,14
9,ND,North Dakota,12


## Conclusion

In this lecture we covered:
- How to use the .loc function to slice and how it differs from .iloc
- How to set and reset DataFrame Indices
- How to use .isin()
- How to perform boolean indexing on dataframes
- How to combine conditions using '|' and '&'
- How to perform math on pandas Series