## Introduction

#### Our goals today are to be able to: <br/>

- Investigate table data in Pandas
- Manipulate Pandas DataFrames

In [3]:
# list comprehension

list_of_squares = []

for x in range(1,11):                   # standard for loop
    list_of_squares.append(x**2)
    
print(list_of_squares)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [4]:
list_of_squares = [x**2 for x in range(1,11)]  # simpler forloop in the list
list_of_squares

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [5]:
list_of_squares = tuple(x**2 for x in range(1,11))  #tuple version
list_of_squares

(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

In [6]:
{x: x**2 for x in range(1,11)}   #dict version

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}

In [7]:
# with an if statement

list_of_odd_squares = [x**2 for x in range(1,11) if x %2 ==1]  # vs. if True <- all cases
list_of_odd_squares

[1, 9, 25, 49, 81]

beautiful is better than ugly.

In [8]:
#nested list comprehension

# 2-D List 
matrix = [[1, 2, 3], [4, 5], [6, 7, 8, 9]] 
  
# Nested List Comprehension to flatten a given 2-D matrix 
flatten_matrix = [val for sublist in matrix for val in sublist] 
  
print(flatten_matrix) 


[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [13]:
square_dict = {x:x**2 for x in range(1,11)}

In [14]:
square_dict.items()  #items() makes the dict in a list of tuples form

dict_items([(1, 1), (2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (10, 100)])

In [16]:
square_dict

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}

In [15]:
[val ** .5 for key, val in square_dict.items()]  # .items() for dict!

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

### Activation:

![excel2](img/excelpic2.jpg)

Most people have used Microsoft Excel or Google sheets. But what are the limitations of excel?

- [Take a minute to read this article](https://www.bbc.com/news/magazine-22223190)
- make a list of problems excel presents

How is using python different?

Python
- create documentation of processes as you code
- reduces chances for human error
- do "drag and drop"
- repeatable
- transparent

## Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  




- The data manipulation capabilities of Pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.



In [1]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd
!ls -la

'pwd' is not recognized as an internal or external command,
operable program or batch file.
'ls' is not recognized as an internal or external command,
operable program or batch file.


In [17]:
pwd

'C:\\Users\\jongs\\Desktop\\DS_2020\\Course_Materials\\code\\hbs-ds-060120\\module-1\\day-5-pandas-1'

In [23]:
ls -la

 Volume in drive C has no label.
 Volume Serial Number is 06B9-E2F5

 Directory of C:\Users\jongs\Desktop\DS_2020\Course_Materials\code\hbs-ds-060120\module-1\day-5-pandas-1



File Not Found


In [25]:
ls -la data

 Volume in drive C has no label.
 Volume Serial Number is 06B9-E2F5

 Directory of C:\Users\jongs\Desktop\DS_2020\Course_Materials\code\hbs-ds-060120\module-1\day-5-pandas-1


 Directory of C:\Users\jongs\Desktop\DS_2020\Course_Materials\code\hbs-ds-060120\module-1\day-5-pandas-1\data

06/05/2020  07:35    <DIR>          .
06/05/2020  07:35    <DIR>          ..
06/05/2020  07:35                65 example1.csv
06/05/2020  07:35               245 made_up_jobs.csv
               2 File(s)            310 bytes
               2 Dir(s)  24,904,986,624 bytes free


File Not Found


In [30]:
import pandas as pd

example_csv = pd.read_csv('data/example1.csv')

There is also `read_excel`, `read_html`, and many other pandas `read_` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [32]:
example_csv  # extra data in the index column due to 0-indexing

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


In [33]:
#pd.read_<then pick file types>   <- read directly from

In [34]:
type(example_csv)        # <- dataframe. specifically 2 dimensional

pandas.core.frame.DataFrame

In [35]:
example_csv.Title1  #<- returns 1 dimensional array

0         one
1    example1
Name: Title1, dtype: object

In [36]:
example_csv.Title2

0         two
1    example2
Name: Title2, dtype: object

In [37]:
example_csv.Title3

0       three
1    example3
Name: Title3, dtype: object

In [38]:
type(example_csv.Title3)   # <- series form the dataframe

pandas.core.series.Series

In [40]:
example_csv.Title1.values  # <- NumPy powers Dandas. under the hood everything is stored as numpy array. fast processing

array(['one', 'example1'], dtype=object)

In [31]:
example_csv.describe()

Unnamed: 0,Title1,Title2,Title3
count,2,2,2
unique,2,2,2
top,one,example2,three
freq,1,1,1


In [42]:
example_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Title1  2 non-null      object
 1   Title2  2 non-null      object
 2   Title3  2 non-null      object
dtypes: object(3)
memory usage: 176.0+ bytes


Try loading in the example file in the `data` directory called `made_up_jobs.csv` using pandas.

In [43]:
# read in your csv here!
muj = pd.read_csv('data/made_up_jobs.csv')

# remember that it's nice to be able to look at your data, so let's do that here, too.
muj.head()

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


strings and integegers (objects and int64) <- same as numpy (because 64 bit computer)

In [44]:
muj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              6 non-null      int64 
 1   Name            6 non-null      object
 2   Job             6 non-null      object
 3   Years Employed  6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 320.0+ bytes


In [45]:
 muj

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50
5,5,Sir Wellington,Cheese Stacker,10


In [46]:
import numpy as np

In [49]:
random_df = pd.DataFrame(np.random.random((10000,10)))

In [50]:
random_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.333295,0.691536,0.274479,0.103752,0.089820,0.032955,0.931244,0.832531,0.045257,0.991766
1,0.700075,0.052749,0.920188,0.281648,0.364281,0.223237,0.810908,0.959597,0.146529,0.586805
2,0.518043,0.691018,0.608022,0.855374,0.031734,0.074386,0.804345,0.123866,0.582671,0.040183
3,0.890673,0.554176,0.707261,0.073859,0.461928,0.090462,0.650608,0.408803,0.621633,0.541748
4,0.066834,0.759129,0.900940,0.573888,0.174845,0.120769,0.322857,0.953788,0.968904,0.341715
...,...,...,...,...,...,...,...,...,...,...
9995,0.416512,0.027442,0.343730,0.338817,0.000622,0.589662,0.893708,0.286026,0.532878,0.405810
9996,0.142079,0.871140,0.559712,0.024109,0.209277,0.063637,0.124690,0.143434,0.634218,0.545851
9997,0.557774,0.417517,0.019464,0.449442,0.408959,0.909669,0.166489,0.278249,0.295726,0.811401
9998,0.029473,0.820559,0.140014,0.554241,0.730783,0.248389,0.087966,0.445462,0.430577,0.269229


In [51]:
random_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       10000 non-null  float64
 1   1       10000 non-null  float64
 2   2       10000 non-null  float64
 3   3       10000 non-null  float64
 4   4       10000 non-null  float64
 5   5       10000 non-null  float64
 6   6       10000 non-null  float64
 7   7       10000 non-null  float64
 8   8       10000 non-null  float64
 9   9       10000 non-null  float64
dtypes: float64(10)
memory usage: 781.4 KB


In [52]:
random_df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
9995,0.416512,0.027442,0.34373,0.338817,0.000622,0.589662,0.893708,0.286026,0.532878,0.40581
9996,0.142079,0.87114,0.559712,0.024109,0.209277,0.063637,0.12469,0.143434,0.634218,0.545851
9997,0.557774,0.417517,0.019464,0.449442,0.408959,0.909669,0.166489,0.278249,0.295726,0.811401
9998,0.029473,0.820559,0.140014,0.554241,0.730783,0.248389,0.087966,0.445462,0.430577,0.269229
9999,0.409096,0.818393,0.321592,0.743119,0.024727,0.590713,0.772749,0.183792,0.162079,0.409524


In [53]:
random_df.size #10000 x 10

100000

In [55]:
pd.DataFrame([[1,2,3],[4,5,'a']])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,a


### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of fruits.

In [56]:
fruits = ['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

print(fruits)

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']


In [57]:
[x.upper() for x in fruits]  # list comprehension can applied everywhere!

['APPLE', 'ORANGE', 'WATERMELON', 'LEMON', 'MANGO']

In [59]:
fruits

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

Using our list of fruits, we can create a pandas object called a 'series' which is much like an array or a vector.

In [60]:
fruits_series = pd.Series(fruits)      # series!

print(fruits_series)
type(fruits_series)

0         Apple
1        Orange
2    Watermelon
3         Lemon
4         Mango
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [62]:
ind = ['a', 'b', 'c', 'd', 'e']    # changing the index from int to alphabet!

fruits_series = pd.Series(fruits, index=ind)  # shift tab to know what to put in

print(fruits_series)

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object


With a partner, create your own custom series from a list of lists.

In [75]:
list_of_lists = [['cat'], ['dog'], ['horse'], ['cow'], ['macaw']] 
#list of lists with one element in it.

# create custom indices for your series
ind = ['a', 'b', 'c', 'd', 'e']

elements=[x[0].lower() for x in list_of_lists]
# making the lists the elements

# create the series using your list objects
# You can use either a for loop or also pd.Series
list_of_lists_series = pd.Series(elements)#, index=ind)

# print your series
print(list_of_lists_series)
type(list_of_lists_series)

0      cat
1      dog
2    horse
3      cow
4    macaw
dtype: object


pandas.core.series.Series

In [65]:
#let's add fruits and make it a data frame


In [207]:
df_a = pd.DataFrame()

In [208]:
df_a['fruits'] = fruits_series

In [209]:
df_a['animals'] = list_of_lists_series

In [210]:
df_a

Unnamed: 0,fruits,animals
a,Apple,
b,Orange,
c,Watermelon,
d,Lemon,
e,Mango,


In [205]:
df['new_animals'] = list_of_lists_series

In [77]:
df  # index must match. otherwise Not a number return. it cannot be matched.

Unnamed: 0,fruits,animals,new_animals
a,Apple,cat,
b,Orange,dog,
c,Watermelon,horse,
d,Lemon,cow,
e,Mango,macaw,


We can do a simliar thing with Python dictionaries. This time, however, we will create a DataFrame object from a python dictionary.

In [81]:
# Dictionary with list object in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

student_dict.values() # 3 lists within this dict

dict_values([['Samantha', 'Alex', 'Dante'], ['35', '17', '26'], ['Houston', 'Seattle', 'New york']])

In [162]:
student_list = [        #list of dictionaries
    {'name': 'Samantha', 'age': '35', 'city': 'Houston'},
    {'name': 'Alex', 'age': '17', 'city': 'Seattle'},
    {'name': 'Dante', 'age': '26', 'city': 'New York'}
]

pd.DataFrame(student_list)

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [84]:
students_df = pd.DataFrame(student_dict.values()) 

# when only values are brought in, keys are not reflected in column names

students_df.head()

Unnamed: 0,0,1,2
0,Samantha,Alex,Dante
1,35,17,26
2,Houston,Seattle,New york


In [85]:
students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [128]:
students_df = pd.DataFrame(student_list)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [129]:
#to find data types of columns
students_df.dtypes

name    object
age     object
city    object
dtype: object

Let's change the data type of ages to int.

In [130]:
# We can also change a columns type but the change has to make sense.
students_df.age = students_df.age.values

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#students_df.name = students_df.name.astype(int)

#How about what happens converting numeric to string
students_df.age = students_df.age.astype(str)

students_df.dtypes

name    object
age     object
city    object
dtype: object

In [131]:
# We can also change a columns type but the change has to make sense.
students_df.age = students_df.age.values

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#students_df.name = students_df.name.astype(int)

#How about what happens converting numeric to string
students_df.age = students_df.age.astype(float)

students_df.dtypes

name     object
age     float64
city     object
dtype: object

In [132]:
pd.DataFrame(students_df)

Unnamed: 0,name,age,city
0,Samantha,35.0,Houston
1,Alex,17.0,Seattle


In [133]:
students_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2 non-null      object 
 1   age     2 non-null      float64
 2   city    2 non-null      object 
dtypes: float64(1), object(2)
memory usage: 176.0+ bytes


We can also use a custom index for these items. For example, we might want them to be the individual student ID numbers.

In [134]:
school_ids = ['1111', '1145', '0096']

# Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
students_df = pd.DataFrame(student_dict, index=school_ids)

students_df.head()

Unnamed: 0,name,age,city
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [135]:
students_df.city

1111     Houston
1145     Seattle
0096    New york
Name: city, dtype: object

In [136]:
students_df.name

1111    Samantha
1145        Alex
0096       Dante
Name: name, dtype: object

Using Pandas, we can also rename column names.

In [137]:
students_df.columns

Index(['name', 'age', 'city'], dtype='object')

In [138]:
students_df.columns = [x.upper() for x in students_df.columns]  #comprehension
students_df.head()

Unnamed: 0,NAME,AGE,CITY
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [139]:
students_df.columns = [x.replace(' ','') for x in students_df.columns]  #getting rid of spaces
students_df.head()

Unnamed: 0,NAME,AGE,CITY
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [140]:
students_df.columns = ['NAME', 'AGE', 'HOME']
students_df.head()

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Or, we can also change the column names using the rename function.

In [141]:
students_df.rename(columns={"AGE": "YEARS"})  #dict to rename. key -> value

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [147]:
# Notice what happens when we print students_df

students_df

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [150]:
students_df.rename(columns={'AGE':'YEARS'}, inplace=True)
pd.DataFrame(students_df)

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [151]:
# If you want the file to save over itself, use the option `inplace = True`.
students_df.rename(columns={'AGE': 'YEARS'}, inplace=True)  # panda reneged on inplace
students_df.head()

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [124]:
students_df=students_df.rename(columns={"AGE": "YEARS"}) 
students_df

Unnamed: 0,NAME,YEARS,CITY
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Know what is possible, understand the mechanism, look it up on google and know how to apply for your purpose.

that is what you need to do.
stackoverflow

Similarly, there is a tool to remove rows and columns from your DataFrame

In [152]:
students_df.drop(columns=['YEARS', 'HOME'])  # remove rows. returns a new data frame

#if you want to keep this,
skinny_df = students_df.drop(columns=['YEARS', 'HOME'])
skinny_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [153]:
#Notice again what happens if we print students_df 
students_df

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [154]:
students_df.drop(columns=['YEARS', 'HOME'], inplace=True)
students_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [163]:
students_df = pd.DataFrame(student_list)
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [164]:
students_df.columns = [x.capitalize() for x in students_df.columns]
students_df.head()


Unnamed: 0,Name,Age,City
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


use drop to remove columns, not rows.
rows are generally filtered


In [165]:
my_df = pd.DataFrame([[1,2,3,],[4,None,6]])
my_df

Unnamed: 0,0,1,2
0,1,2.0,3
1,4,,6


In [166]:
my_df.dropna()       # drop any row with null values

# reminder: my_df has not been changed!

Unnamed: 0,0,1,2
0,1,2.0,3


In [168]:
my_df_updated = my_df.dropna()  
my_df_updated

Unnamed: 0,0,1,2
0,1,2.0,3


If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [170]:
l = [1, 2, 3, 4, 5]
l[[0, 3]]   #no way to get specific, disparate element. it can only select one or slice.

TypeError: list indices must be integers or slices, not list

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [180]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

In [173]:
students_df.loc[:, 'name']  #rows, columns

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

In [175]:
students_df.loc[:,:]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [176]:
students_df.loc[:,'city']

0     Houston
1     Seattle
2    New york
Name: city, dtype: object

In [184]:
students_df.loc[:,'age':'city']

Unnamed: 0,age,city
0,35,Houston
1,17,Seattle
2,26,New york


In [185]:
students_df.loc[0:1,:]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [224]:
# returns the first row
students_df.iloc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [225]:
# returns the first column
students_df.iloc[:, 0]

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

In [226]:
# returns first two rows notice that ILOC performs regular python slicing.
students_df.iloc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [227]:
# returns the first two columns
students_df.iloc[:, 0:2]

Unnamed: 0,name,age
0,Samantha,35
1,Alex,17
2,Dante,26


In [228]:
# returns first row and columns 1 and 2
students_df.iloc[0:1, 0:2]

Unnamed: 0,name,age
0,Samantha,35


### How would we use `.iloc` to return the last item in the last row?


In [229]:
# return the last item in the last row using iloc
students_df.iloc[-1,-1]

'New york'

### How would we use `.iloc` to return the last item in the last column?


In [230]:
# return the last item in the last column using iloc
students_df.iloc[-1,-1]

'New york'

### What if we only want certain columns or rows?

In [231]:
# Don't do students_df.iloc[0, 2]
students_df.iloc[[0, 2]]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
2,Dante,26,New york


In [232]:
students_df.iloc[[0, 2], [0, 2]]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New york


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [211]:
# We will use loc to return rows and columns based on labels. Let's look at the students_df DataFrame again.
students_df.iloc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [214]:
students_df.loc[0:2] #inclusive of the 2. not abiding by the 0 indexing convention

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [233]:
students_df.iloc[0:2] #excluisve of the 2. 0 indexing convention

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [234]:
# returns the student information associated with index 0
students_df.loc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [235]:
students_df.iloc[0:1, 0:2]

Unnamed: 0,name,age
0,Samantha,35


In [236]:
# returns the student information for row index 0 to 2 inclusive.
# note iloc would return normal python slicing not including 2 as demonstrated above.
students_df.loc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [220]:
students_df.iloc[-1,-1] #last item in the last column

'New york'

In [237]:
# returns the column labeled 'age'
students_df.loc[:, 'age']

0    35
1    17
2    26
Name: age, dtype: object

In [238]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age"
students_df.loc[1:2, 'age']

1    17
2    26
Name: age, dtype: object

In [239]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age to city (inclusive)"
students_df.loc[1:2, 'age':'city']

Unnamed: 0,age,city
1,17,Seattle
2,26,New york


In [240]:
# What should we get?
students_df.loc[1:2, ['name', 'city']]

Unnamed: 0,name,city
1,Alex,Seattle
2,Dante,New york


In [241]:
# How about?
students_df.loc[[0, 2], ['name', 'city']]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New york


In [242]:
# if index rearranged
school_ids = ['5', '11', '3']
students_df = pd.DataFrame(student_dict, index=school_ids)

In [243]:
students_df   #now index has been changed

Unnamed: 0,name,age,city
5,Samantha,35,Houston
11,Alex,17,Seattle
3,Dante,26,New york


In [244]:
# What should we get now?
students_df.loc[[0, 2], ['name', 'city']]  # no longer called 0 2

KeyError: "None of [Int64Index([0, 2], dtype='int64')] are in the [index]"

In [245]:
# What should we get now?
students_df.loc[['5', '11'], ['name', 'city']]

Unnamed: 0,name,city
5,Samantha,Houston
11,Alex,Seattle


In [254]:
students_df = students_df.reset_index()
students_df 

Unnamed: 0,index,name,age,city
0,0,Samantha,35,Houston
1,1,Alex,17,Seattle
2,2,Dante,26,New york


In [256]:
students_df.reset_index().set_index('name')  # chain commands

Unnamed: 0_level_0,level_0,index,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Samantha,0,0,35,Houston
Alex,1,1,17,Seattle
Dante,2,2,26,New york


In [260]:
students_df.reset_index()

Unnamed: 0,level_0,index,name,age,city
0,0,0,Samantha,35,Houston
1,1,1,Alex,17,Seattle
2,2,2,Dante,26,New york


In [271]:
students_df

Unnamed: 0,name,age,city,state
0,Samantha,35,Houston,Texas
1,Alex,17,Seattle,Washington
2,Dante,26,New york,New York
3,Samantha,21,Atlanta,Georgia


In [246]:
students_df.set_index("name", inplace=True)  # setting the name to be the index

students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [249]:
students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [247]:
students_df.loc[['Samantha']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston


In [None]:
students_df.set_index()

In [248]:
# Subsetting nonconsecutive rows
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New york


In [None]:
# Samantha to the end
students_df.loc['Samantha':]

In [None]:
# return the first and last rows using one loc command

### Boolean Subsetting

In [263]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante', 'Samantha'],
    'age': ['35', '17', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

In [264]:
# The statement data[‘name’] == ‘Samantha’] produces a Pandas Series with a True/False value for every row
# in the ‘data’ DataFrame, where there are “True” values for the rows where the name is “Samantha”.
# These type of boolean arrays can be passed directly to the .loc indexer.
students_df.loc[students_df['name'] == 'Samantha']

Unnamed: 0,name,age,city,state
0,Samantha,35,Houston,Texas
3,Samantha,21,Atlanta,Georgia


In [265]:
# What about if we only want the city and state of the selected students with the name Samantha?
students_df.loc[students_df['name'] == 'Samantha', ['city', 'state']]

Unnamed: 0,city,state
0,Houston,Texas
3,Atlanta,Georgia


In [266]:
# What amount if we want to select a student of a specific age?
students_df.loc[students_df['age'] == '21']

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [267]:
# What amount if we want to select a student of a specific age?
students_df.loc[(students_df['age'] == '21') &  # not the same as and. intersectation
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [268]:
# What should be returned?
students_df.loc[(students_df['age'] == '35') &
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
