# Reference guide: Pandas tools for structuring a dataset

As you’ve learned, there are far too many Python functions to memorize all of them. That’s why, as every data professional will tell you, you’ll be using reference sheets and coding libraries nearly every day in your data analysis work. 

The following reference guide will help you identify the most common Pandas tools used for structuring data. Note that this is just for reference. For detailed information on how each method works, including explanations of every parameter and examples, refer to the linked documentation.

## Save this course item

You may want to save a copy of this guide for future reference. You can use it as a resource for additional practice or in your future professional projects. To access a downloadable version of this course item, click the link below and select “Use Template.” 

## **Combine data**

Note that for many situations that require combining data, you can use a number of different functions, methods, or approaches. Usually you’re not limited to a single “correct” function. So if these functions and methods seem very similar, don’t worry! It’s because they are! The best way to learn them, determine what works best for you, and understand them is to use them!

[**df.merge()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html "Link to pandas merge method documentation")

- A method available to the DataFrame class.
    
- Use df.merge() to take columns or indices from other dataframes and combine them with the one to which you’re applying the method.
    
- Example:
    
- **Note**: The following code block is not interactive.
    

In [None]:
df1.merge(df2, how=‘inner’, on=[‘month’,’year’])

[**pd.concat()**](https://pandas.pydata.org/docs/reference/api/pandas.concat.html "Link to pandas concat function documentation")

- A pandas function to combine series and/or dataframes
    
- Use pd.concat() to join columns, rows, or dataframes along a particular axis
    
- Example:
    
- **Note**: The following code block is not interactive.
    

In [None]:
df3 = pd.concat([df1.drop(['column_1','column_2'], axis=1), df2])

[**df.join()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html "Link to pandas join method documentation")

- A method available to the DataFrame class.
    
- Use df.join() to combine columns with another dataframe either on an index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
    
- Example:
    
- **Note**: The following code block is not interactive.
    

In [None]:
df1.set_index('key').join(df2.set_index('key'))

Visual representation of a combination:

![image.png](attachment:e9d7be42-6e9e-4e3e-957c-f74284400b71.png)

## Extract or select data

`df[[columns]]`

- Use df[[columns]] to extract/select columns from a dataframe. Example:
    

In [None]:
print(df)

print()

df[['animal', 'legs']]

RunReset

     animal     class  color  legs
0  cardinal      Aves    red     2
1     gecko  Reptilia  green     4
2     raven      Aves  black     2

     animal  legs
0  cardinal     2
1     gecko     4
2     raven     2

[**df.select_dtypes()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)

- A method available to the DataFrame class.
    
- Use df.select_dtypes() to return a subset of the dataframe’s columns based on the column dtypes (e.g., float64, int64, bool, object, etc.). Example:
    

In [None]:
print(df)

print()

df2 = df.select_dtypes(include=['int64'])

df2

RunReset

     animal     class  color  legs
0  cardinal      Aves    red     2
1     gecko  Reptilia  green     4
2     raven      Aves  black     2

   legs
0     2
1     4
2     2

Visual representation of extraction:

![image.png](attachment:df3383a4-abab-485a-ae94-3e7dda5cb45d.png)

## Filter data

Recall from Course 2 that Boolean masks are used to filter dataframes.

df[condition]

- Use df[condition] to create a Boolean mask, then apply the mask to the dataframe to filter according to selected condition.
    
- Example:
    

In [None]:
print(df)

print()

df[df['class']=='Aves']

RunReset

     animal     class  color  legs
0  cardinal      Aves    red     2
1     gecko  Reptilia  green     4
2     raven      Aves  black     2

     animal class  color  legs
0  cardinal  Aves    red     2
2     raven  Aves  black     2

Visual representation of filtering:

![image.png](attachment:dc9c6d5d-08ee-497b-8df0-d9f80168e2e4.png)

## Sort data

[**df.sort_values()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html "Link to sort_values function pandas documentation")

- A method available to the DataFrame class.
    
- Use df.sort_values() to sort data according to selected parameters.
    
- Example:
    

In [None]:
print(df)

print()

df.sort_values(by=['legs'], ascending=False)

RunReset

     animal     class  color  legs
0  cardinal      Aves    red     2
1     gecko  Reptilia  green     4
2     raven      Aves  black     2

     animal     class  color  legs
1     gecko  Reptilia  green     4
0  cardinal      Aves    red     2
2     raven      Aves  black     2

Visual representation of sorting:

![A pair of 3-row columns is shown to be sorted alphabetically and numerically.](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/uKutiLU4QFmir2PBVsYmdw_470e1433100e4a28ad3f50742ddc15f1_image1.png?expiry=1714003200000&hmac=F6rIQ-YMsAMapBHZNR-3C-mtBG1jzxaVluGVsIiIIiM)

## Slice data 

[**df.iloc[]**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html "Link to iloc pandas documentation")

- Use df.iloc[] to slice a dataframe based on an integer index location.
    
- Examples:

`df.iloc[5:10, 2:]` → selects only rows 5 through 9, at columns 2+

`df.iloc[5:10]` → selects only rows 5 through 9, all columns 

`df.iloc[1, 2]` → selects value at row 1, column 2 

`df.iloc[[0, 2], [2, 4]]` → selects only rows 0 and 2, at columns 2 and 4
    

[**df.loc[]**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html "Link to loc pandas documentation")

- Use df.loc[] to slice a dataframe based on a label or Boolean array.
    
- Example:

In [None]:
print(df)

print()

df.loc[:, ['color', 'class']]

RunReset

     animal     class  color  legs
0  cardinal      Aves    red     2
1     gecko  Reptilia  green     4
2     raven      Aves  black     2

   color     class
0    red      Aves
1  green  Reptilia
2  black      Aves

## Key takeaways

The tools in this reference guide are foundational to structuring data, including filtering, sorting, merging, and slicing. You will find yourself using them throughout your career as a data professional. 

## Resources for more information

Refer to these links for more details on Python functions and their various parameters. 

- [Pandas documentation to describe parameters in Python functions](https://pandas.pydata.org/docs/index.html)
    
- [W3schools provides explanations for Python functions in an easy-to-understand way](https://www.w3schools.com/)