# Slicing, Filtering and Merging Data

**[Slicing Data](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#slicing-data)**
* [Selecting columns by label (column name)](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-columns-by-label-column-name)
* [Selecting columns by position](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-columns-by-position)
* [Selecting a customized list of columns](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-a-customized-list-of-columns)
* [Selecting rows by position](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-rows-by-position)
* [Selecting rows by boolean conditions](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-rows-by-boolean-conditions)
* [Slicing on Multiple Parameters](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#slicing-on-multiple-parameters)

**[Filtering Data](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#filtering-data)**
* [Selecting rows whose column values is not NA/ NaN](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-rows-whose-column-values-is-not-na-nan)
* [Selecting rows where column value is one of the following: "x", "y" or "z" (a list)](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#selecting-rows-where-column-value-is-one-of-the-following-x-y-or-z-a-list)

**[Merging Dataframes (Dataframe concatenation and SQL joins)](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#merging-dataframes-dataframe-concatenation-and-sql-joins)**
* [Dataframe concatenation](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#dataframe-concatenation)
* [Join two dataframes along rows](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#join-two-dataframes-along-rows)
* [Join two dataframes along columns](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#join-two-dataframes-along-columns)
* [Dataframe Merging (SQL Joins)](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#dataframe-merging-sql-joins)
* [Merge with Inner Join](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#merge-with-inner-join)
* [Merge with Right Join](https://github.com/kn-kn/python-guide/wiki/Slicing%2C-Filtering%2C-and-Merging#merge-with-right-join)


**The following tables will be used for the majority of the examples below:**

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('C:/Users/kenguyen/Downloads/DemographicData.csv')

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])

raw_data = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])

In [2]:
df.head(5)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


In [3]:
df_a

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches


In [4]:
df_b

Unnamed: 0,subject_id,first_name,last_name
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


## Slicing Data

### Slicing Columns

#### Select Columns by Label (column name)

In [5]:
df[['Country Name', 'Internet users', 'Income Group']].head(5)

Unnamed: 0,Country Name,Internet users,Income Group
0,Aruba,78.9,High income
1,Afghanistan,5.9,Low income
2,Angola,19.1,Upper middle income
3,Albania,57.2,Upper middle income
4,United Arab Emirates,88.0,High income


#### Selecting Columns by Position

The `iloc` indexer for Pandas Dataframes is used for integer location-based indexing/selection by position.

In [6]:
# Returns as a series
df.iloc[:,2].head(5)

0    10.244
1    35.253
2    45.985
3    12.877
4    11.044
Name: Birth rate, dtype: float64

#### Selecting a Customized List of Columns

In [7]:
df.iloc[:, [0,3,4]].head(5)

Unnamed: 0,Country Name,Internet users,Income Group
0,Aruba,78.9,High income
1,Afghanistan,5.9,Low income
2,Angola,19.1,Upper middle income
3,Albania,57.2,Upper middle income
4,United Arab Emirates,88.0,High income


### Slicing Rows

The `loc` indexer can be used to either 
* Select rows by label/index 
* Select rows with a boolean / conditional lookup

#### Selecting rows by position

In [8]:
df.loc[55:57]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
55,Ethiopia,ETH,32.925,1.9,Low income
56,Finland,FIN,10.7,91.5144,High income
57,Fiji,FJI,20.463,37.1,Upper middle income


#### Selecting rows by boolean conditions

In [9]:
df.loc[df['Country Name'] == 'Canada']

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
30,Canada,CAN,10.9,85.8,High income


In [10]:
df.loc[df['Birth rate'] > 45]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
2,Angola,AGO,45.985,19.1,Upper middle income
127,Niger,NER,49.661,1.7,Low income
167,Chad,TCD,45.745,2.3,Low income


### Slicing on Multiple Parameters

In [11]:
df[(df['Birth rate'] > 35) & (df['Internet users'] > 19)]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
2,Angola,AGO,45.985,19.1,Upper middle income
91,Kenya,KEN,35.194,39.0,Lower middle income
128,Nigeria,NGA,40.045,38.0,Lower middle income


In [12]:
# Selects row 14, 15, 16 and the 2 columns specified
df.loc[15:17, ['Country Name', 'Country Code']]

Unnamed: 0,Country Name,Country Code
15,Bangladesh,BGD
16,Bulgaria,BGR
17,Bahrain,BHR


## Filtering Data

### Selecting rows whose column values is not NA/ NaN

In [13]:
df[df.notnull()].head(5)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


### Selecting rows where column value is one of the following: "x", "y" or "z" (a list)

In [14]:
country_list = ['Canada', 'Japan', 'South Africa', 'Germany']
df[df['Country Name'].isin(country_list)]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
30,Canada,CAN,10.9,85.8,High income
45,Germany,DEU,8.5,84.17,High income
89,Japan,JPN,8.2,89.71,High income
191,South Africa,ZAF,20.85,46.5,Upper middle income


## Merging Dataframes (Dataframe concatenation and SQL joins)

### Dataframe concatenation

#### Join two dataframes along rows

In [15]:
pd.concat([df_a, df_b])

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


#### Join two dataframes along columns

In [16]:
pd.concat([df_a, df_b], axis = 1)

Unnamed: 0,subject_id,first_name,last_name,subject_id.1,first_name.1,last_name.1
0,1,Alex,Anderson,4,Billy,Bonder
1,2,Amy,Ackerman,5,Brian,Black
2,3,Allen,Ali,6,Bran,Balwner
3,4,Alice,Aoni,7,Bryce,Brice
4,5,Ayoung,Atiches,8,Betty,Btisan


### Dataframe Merging (SQL Joins)

The Pandas `merge()` function works very similar to SQL joins.

[`merge()` Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html)

The function takes the following parameters:
`merge(left_df, right_df, how='method', left_on='column_name', right_on='column_name')`

Where:

* **`left_df`**: Your left dataframe
* **`right_df`**: Your right dataframe
* **`how`**: The type of join you want to do (left, right, outer, inner)
* **`left_on`**: The column on your left dataframe to join on
* **`right_on`**: The column on your right dataframe to join on
* OPTIONAL: If the columns you are joining on have the same name, you can use `on` by itself instead of `left_on` and `right_on`

#### Merge with Inner Join

In [17]:
pd.merge(df_a, df_b, on='subject_id', how='inner')

Unnamed: 0,subject_id,first_name_x,last_name_x,first_name_y,last_name_y
0,4,Alice,Aoni,Billy,Bonder
1,5,Ayoung,Atiches,Brian,Black


#### Merge with Right Join

In [18]:
pd.merge(df_a, df_b, on='subject_id', how='right')

Unnamed: 0,subject_id,first_name_x,last_name_x,first_name_y,last_name_y
0,4,Alice,Aoni,Billy,Bonder
1,5,Ayoung,Atiches,Brian,Black
2,6,,,Bran,Balwner
3,7,,,Bryce,Brice
4,8,,,Betty,Btisan
