# 01.24.22 Week 2, Lecture 1: Pandas 🐼!

- 01.24.22 Cohort
- 02/01/22

# Loading Data with Python

## `pd.read_csv()` and `pd.read_excel()`


- Different file types need different functions to load them:
    - .csv files → pd.read_csv(‘filepath’)
    - .xlsx files → pd.read_excel(‘filepath’)


- ‘filepath’ can be a remote URL as long as it points directly to a .csv or .xlsx file.
- Other functions exist to load other file types, but these are the ones we will use.


In [None]:
import pandas as pd
file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS7TaxsUixSyoL0Rn8LPfbWIjeTd2-QdoZ0B2Knk14XYEmUzHUL-UhMilWK34Fn9dGjTcuo0-teSLU2/pub?output=csv"


In [None]:
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Headers 

### Headers as the top row (default)

- Header is top row. (row 0) 
    - This is default value:
        -  `pd.read_excel(‘path’)`
        - `pd.read_excel(‘path’, header = 0)`


In [None]:
## equivalent to cell below
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
## equivalent to cell above
df = pd.read_csv(file_url,header=0)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### NO headers
`pd.read_excel(‘path’, header = None)`

In [None]:
df = pd.read_csv(file_url,header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
...,...,...,...,...,...,...,...,...,...,...,...,...
887,887,0,2,"Montvila, Rev. Juozas",male,27,0,0,211536,13,,S
888,888,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30,B42,S
889,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30,C148,C


### Header present, but not at the top

In [None]:
file_url_headers = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9gr3YlHKDVCohqfXbeJ4e6Oxg97qc9cs-JKlFDkzBV9_32ZtRLqAhCNWBpJvm2mEYtW3xhOzkfQ-u/pub?output=csv"
df = pd.read_csv(file_url_headers)
df

Unnamed: 0.1,Unnamed: 0,Example CSV with the headers not in the first row,Unnamed: 2,Unnamed: 3,Unnamed: 4,Fname = titanic - added headers,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,,,,,,,,,,,,
1,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
3,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
4,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
888,887,0,2,"Montvila, Rev. Juozas",male,27,0,0,211536,13,,S
889,888,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30,B42,S
890,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
891,890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30,C148,C


In [None]:
df = pd.read_csv(file_url_headers,header=2)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Quick Look at top and bottom of dataframe

- This will give you the top 5 rows: `df.head()`
- To see a different number of rows such as the first 8 rows:	 `df.head(8)`

- To see the bottom 5 rows	`df.tail()`

- To see a different number of rows such as the last 11 rows `df.tail(11)`


- To see random rows `df.sample(5)`
		


In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
#This will give you the top 5 rows: 
df.head() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
## top 8 rows
df.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [None]:
# bottom 5
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [None]:
# bottom 11
df.tail(11)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


In [None]:
## to see 5 random rows
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
611,612,0,3,"Jardin, Mr. Jose Neto",male,,0,0,SOTON/O.Q. 3101305,7.05,,S
652,653,0,3,"Kalvik, Mr. Johannes Halvorsen",male,21.0,0,0,8475,8.4333,,S
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S


## Checking data types


- To get lots of info including data type, index,	and Column Names	`df.info()`
	   

- To get JUST the data type for all columns: `df.dtypes`


- To get the data type for just one column	`df[‘name’].dtypes`



In [None]:
# lots of info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
save_info = df.info()
type(save_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


NoneType

In [None]:

save_info

In [None]:
# just data types
saved_dtypes = df.dtypes


In [None]:
saved_dtypes[saved_dtypes == 'object']


Name        object
Sex         object
Ticket      object
Cabin       object
Embarked    object
dtype: object

In [None]:
# df['Age'].astype(int)

In [None]:
## dtype for 1 col
df['Survived'].dtype

dtype('int64')

## How many rows and columns?


- To get lots of info including # of rows and columns `df.info()`


- To get the shape (# of rows, # of  columns)`df.shape`


- To get the number of rows	 `len(df)`


- To get the number of columns `len(df.columns)`
			
										



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
df.shape

(891, 12)

In [None]:
len(df)

891

In [None]:
len(df.columns)

12

## Slicing Dataframes

1. To select 1 column as a pandas series				`df[‘name’]`

2. To select 1 column as a pandas dataframe			`df[[‘name’]]`

3. To select multiple columns as a dataframe			`df[[‘name’, ‘Manufacturer’]]`
  
>  - Notice the double brackets `[[` `]]` in 2 and 3


In [None]:
## 1 column as a series 
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [None]:
## 1 column as a dataframe 
df[['Age']]

Age  
24.00    30
22.00    27
18.00    26
30.00    25
28.00    25
         ..
20.50     1
14.50     1
12.00     1
0.92      1
80.00     1
Length: 88, dtype: int64

In [None]:
## multiple columns as a dataframe
df[['Age','Survived',"Sex"]].value_counts()#.reset_index()

Age   Survived  Sex   
28.0  0         male      16
19.0  0         male      16
21.0  0         male      16
25.0  0         male      14
24.0  1         female    14
                          ..
                male       1
23.5  0         male       1
23.0  1         male       1
      0         female     1
80.0  1         male       1
Length: 216, dtype: int64

# Accessing and Changing Data


- [Share URL for Today's Data](https://docs.google.com/spreadsheets/d/1kqEJtfsfZGY0idcgkH-2q3RitZlfGRDG1PBu_cLLGjI/edit?usp=sharing)

## Index

- When you read in a dataset, pandas will assign an index value to each row in the dataset
 
    - Remember Python uses 0 indexing, so the first row of data will be assigned 0.

- You can leave this alone, but sometimes we may want to change our index to correspond with a column of our data

- For example, if our data contains a column such as Employee_ID, that might be a logical column to use as the index

- In order to be used as an index, every value in the column must be unique!




In [None]:
file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSBBVejdY87e_8cMG9xUEu2Q2DiQC1rHBgmq9ecR8kduFcdsae_Z1B12303buUISaPhHWopT9-Howzi/pub?output=csv#"
# file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS7TaxsUixSyoL0Rn8LPfbWIjeTd2-QdoZ0B2Knk14XYEmUzHUL-UhMilWK34Fn9dGjTcuo0-teSLU2/pub?output=csv"
df = pd.read_csv(file_url)

display(df.head() )
df.index

Unnamed: 0,EmployeeID,Name,Height,Age,State,Role
0,1,Tom,68,33,CA,Senior Advisor
1,2,Jerry,72,27,NY,Junior Advisor
2,3,Ann,69,42,MD,Senior Agent
3,4,Hubert,68,35,WA,Junior Agent
4,5,Monica,63,64,TX,Senior Agent


RangeIndex(start=0, stop=11, step=1)

### Setting the index


#### Using `index_col` in `pd.read_csv`/`pd.read_excel`

- By specifying the `index_col` argument for `pd.read_csv` and `pd.read_excel`, you can tell pandas what row should automatically be loaded as the index.

In [None]:
# can set the index column using its # index 
df = pd.read_csv(file_url,index_col=0)
display(df.head())
df.index

Unnamed: 0_level_0,Name,Height,Age,State,Role
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Tom,68,33,CA,Senior Advisor
2,Jerry,72,27,NY,Junior Advisor
3,Ann,69,42,MD,Senior Agent
4,Hubert,68,35,WA,Junior Agent
5,Monica,63,64,TX,Senior Agent


Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64', name='EmployeeID')

#### Using `df.set_index()`

In [None]:
## or use .set index after importing
df1 = pd.read_csv(file_url)
df1


Unnamed: 0,EmployeeID,Name,Height,Age,State,Role
0,1,Tom,68,33,CA,Senior Advisor
1,2,Jerry,72,27,NY,Junior Advisor
2,3,Ann,69,42,MD,Senior Agent
3,4,Hubert,68,35,WA,Junior Agent
4,5,Monica,63,64,TX,Senior Agent
5,6,Suzy,65,15,MA,Juior Intern
6,7,Greg,70,45,CA,Senior Accountant
7,8,Bob,71,52,NY,Senior Agent
8,9,Nora,66,27,MD,PR Rep - Junior
9,10,Vadim,72,38,TX,PR Rep - Senior


In [None]:
df1.set_index('EmployeeID')

Unnamed: 0_level_0,Name,Height,Age,State,Role
EmployeeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Tom,68,33,CA,Senior Advisor
2,Jerry,72,27,NY,Junior Advisor
3,Ann,69,42,MD,Senior Agent
4,Hubert,68,35,WA,Junior Agent
5,Monica,63,64,TX,Senior Agent
6,Suzy,65,15,MA,Juior Intern
7,Greg,70,45,CA,Senior Accountant
8,Bob,71,52,NY,Senior Agent
9,Nora,66,27,MD,PR Rep - Junior
10,Vadim,72,38,TX,PR Rep - Senior


In [None]:

df1 = df1.set_index('EmployeeID')


##### `inplace=True` or not?

- Pandas, by default, does not change a dataframe when running an operation like `df.set_index()`
    - Instead, it returns a NEW dataframe with the chagnes. 
- To save the changes, you can either:
    - Overwrite the original variable:
        - `df = df.set_index("PassengerId")`
    - OR Add `inplace=True`
        - `df.set_index("PassengerId",inplace=True)`
    - BUT NEVER DO BOTH OR YOU WILL WIND UP WITH AN EMPTY VARAIBLE!

In [None]:
## using inplace=True
df2 = pd.read_csv(file_url)
df2.set_index('EmployeeID',inplace=True)
df2

In [None]:
## DO NOT USE BOTH METHODS TOGETHER!!
df = pd.read_csv(file_url)
df = df.set_index('EmployeeID',inplace=True)
df

In [None]:
## this will error- Empty Vars have no methods/attributes!
# df.info()

In [None]:
### Final correct dataframe
df = pd.read_csv(file_url, index_col=0)
df


# Slicing with Pandas Locators: `df.loc[]` and `df.iloc[]`



- Data in dataframes can be accessed directly using the powerful indexing tools, .iloc and .loc.
- These are used like slicing, with square brackets, not like methods with parentheses.

- Can be thought of as latitude and longitude (row, column)

- They both take arguments in the same format. `df.loc[row, column]` and `df.iloc[row, column]`

- **A row** is one **observation** and **a column** is one variable or **feature**.

- However, they differ slightly in how you reference the row and column.




## Integer-based indexing `df.iloc`: 

- If you know the position of the rows/columns you want, you can use `iloc`, which ignores the actual index values and column names. 
    - It instead used the position you specify. (like slicing from a list)
- It only accepts integers.

### `iloc` Examples:


In [None]:
# return the entire 8th row as a series - with just row #
df.iloc[7]

In [None]:
# return the entire 8th row as a series - with row # plus slicing all columns
df.iloc[7,:]

In [None]:
#  Return the 8th through 10th rows as a dataframe.
df.iloc[7:10]

In [None]:
# return the 2nd column of the 8th row (just the VALUE in the cell)
df.iloc[7,1]

In [None]:
# return the 2nd through 5th columns of the 8th row (as a series)
df.iloc[7,1:5]

## General Locator `df.loc`

- Can take strings or integers ( whatever is the current index values and column names of the dataframe)
- Uses the actual values of the index and column names.

In [None]:
## Reload our df with the Names as the index
df = pd.read_csv(file_url, index_col=1)
df.head()

#### Example: Index of Jerry's Height

In [None]:
df.loc['Jerry','Height']

In [None]:
df.loc['Jerry':"Hubert", 'Height']

#### Slicing Multiple Columns with `.loc`

In [None]:
## get the rows from Jerry to Hubert's heigh AND State
df.loc["Jerry":"Hubert", ["Height",'State']] 

### `.loc` Can Also be Used to **Change** Values

In [None]:
## change Jerry's height to 6' 5" (77 inches)
df.loc["Jerry",'Height'] = 77
df

In [None]:
#Change ‘Jerry’ through ‘Hubert’s Heights to the mean of all heights 
df.loc['Jerry':'Hubert','Height'] = df['Height'].mean()
df


# Filtering


- Filtering is the process of selecting data based on particular criteria

- We often want to focus on a subset of our data.

> **You can use boolean expressions to filter rows by the value in a column.**


- You can select using >, <, ==	, or any other function or method that returns a boolean array.


### Simple Filtering Examples

In [None]:
### reloading example data
df = pd.read_csv(file_url,index_col='EmployeeID')
df

In [None]:
## select all columns for whose age is 15 - with just square brackkets
df[df['Age'] == 15]

In [None]:
## select all columns for whose age is 15 - with .loc
df.loc[df['Age'] == 15,:]

In [None]:
# To select the names of people who are 15:
df.loc[df['Age'] == 15, 'Name']

In [None]:
## select the names of people who are 15:	
df.loc[df['Age'] == 15, 'Name']

## Filtering In-Depth: `boolean` Series/Indices

- Using a series with a boolean expression returns a pandas series of boolean values (T/F)
    - if we have a True or False value for every row in our dataframe, we can call that a **boolean index**



In [None]:
## what does df['age'] >15 return?
df['Age']>15

>- If we use this boolean series as an index value with .loc, it will only return the rows with ‘True’.
    - If we are ONLY filtering based on rows, we can skip the `.loc`

In [None]:
# return the rows for people over the age of 15 - with .loc
filter = df['Age']> 15
df.loc[filter]

In [None]:
# return the rows for people over the age of 15 - without .loc
filter = df['Age']> 15
df[filter]

- We can also use filtering to **change** values
    - In this case, we MUST use `.loc`

In [None]:
## Change the Heights for all employees over 15 years old to 65 inches
df.loc[filter, 'Height']  = 65

In [None]:
df

### Filtering with Boolean String Methods

- There are a lot of string methods and many return booleans!!
    - `str.endswith()`
    - `str.startswith()`

- We can use these string methods on a column in pandads if we add the **`.str` accessor**



#### Examples: Filterting with String Boolean Methods




- There are similar methods for ends with or contains

df[‘City’].str.endswith(‘i’)    outputs cities that end in ‘i’ such as Miami

df[‘City’].str.contains(‘or’) outputs cities that have ‘or’ in them such as New York


In [None]:
# To evaluate if the name of a state starts with an "N"
starts_N = df['State'].str.startswith("N")
starts_N

In [None]:
## to evaluate if their role ends with the word intern
interns = df['Role'].str.endswith('Intern')
interns

In [None]:
## To evaluate if the employee is a Senior level employee
senior_level = df['Role'].str.contains('Senior')
senior_level

## Combining Location and Min/Max

In [None]:
## reload original df
df = pd.read_csv(file_url,index_col=0)
df

- `df.min()` gives us the minimum value, but how do we find information about the data with this minimum value? 

In [None]:
# Option 1: Assign a variable and use locator
min_val = df['Height'].min()
df.loc[ df['Height'] == min_val,:]

In [None]:
# Option 2: Combine into one line
df.loc[ df['Height'] == df['Height'].min()]


# Multiple Filters (AKA Filtering on Multiple Conditions)

- To combine filters use:
    - `&` (and) if BOTH must be true
    - `|` (or) if EITHER must be true
    - `~` (negation) to take the OPPOSITE of true/false

In [None]:
df['Name'].map(lambda x: x[0]).sort_values()

In [None]:
## Filter any employees from NY who are over 30 years old
filter_ny = df['State'] == "NY"
filter_age =  df['Age'] > 30
df.loc[filter_ny & filter_age]

- To combine multiple filters without saving each filter as a separate variable:
    - Put each logical filter inside of its own tuple `( )`

In [None]:
## filter for ny and over 30 in 1 line
df.loc[(df['State'] == "NY") & (df['Age'] > 30)]

### No rows? No Problem!

- Let's say we try to filter based on the 3 conditions below?

In [None]:
filter_ny = df['State'] == "NY"
filter_age =  df['Age'] > 30
filter_juniors = df['Role'].str.contains("Junior")
df.loc[filter_ny & filter_age & filter_juniors]

- We tried to combine 3 filtersn and we did not get any rows returned.
    - This is not an error!
- This means there are NO data points that meet our criteria. 


# Challenge Activity

- City Population Challenge - Revisited:

- Make a copy of this [Colab Notebook](https://colab.research.google.com/drive/1hux10mJq7IxP9P5hZO3PpMxWvxlboI-i?usp=sharing)

- Continue on from header
“Exploring City Populations - Part 2 [NEW]” 



<!-- - Download the [Cereals Data Set](https://drive.google.com/file/d/1YzP0CF_stFjav6fkXeQ0J0gpk1PQoY2l/view)
(Notice this is an Excel file!)

- Make a copy of [this Colab Notebook](https://colab.research.google.com/drive/1DEn2RJO7YOyAurvSIHG3hYh_dTuFgA4Q?usp=sharing)

- Mount your drive

    - OR use the file_url included in the activity notebook to load the file from url.
 -->
