Skip to content

🐼 A practical pandas cheatsheet with examples. Includes grouped syntax by category (inspection, manipulation, aggregation, joins, datetime, nulls) and a sample DataFrame to test all operations β€” perfect for interviews, studying, or daily reference.

License

Notifications You must be signed in to change notification settings

MarkPhamm/pandas-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pandas Syntax Cheatsheet

This guide covers essential pandas syntax across data inspection, basic manipulation, aggregation, joins/unions, datetime operations, and null handling.

Getting Started

To get started with this project, follow these steps:

  1. Clone the repository:

    git clone https://github.com/MarkPhamm/pandas-tutorial.git
    cd pandas-tutorial
  2. Create a virtual environment:

    python -m venv venv
  3. Activate the virtual environment:

    • On Windows:
      venv\Scripts\activate
    • On macOS and Linux:
      source venv/bin/activate
  4. Install the required packages:

    pip install -r requirements.txt

1. Data Inspection

1.1 View the first few rows

df.head()

1.2 View the last few rows

df.tail()

1.3 View column names

df.columns

1.4 Access a single column

df['col']
# or
df.col

1.5 Check the shape of the DataFrame

df.shape

1.6 Check data types and non-null counts

df.info()

1.7 Get summary statistics

df.describe()

1.8 Check for duplicates

df.duplicated().sum()

1.9 See value counts in a column

df['col'].value_counts()

1.10 .empty to check if a df is empty

if df.empty:
   df = pd.DataFrame({'SecondHighestSalary': [None]}) 

2. Basic Manipulation

2.1 Select specific columns

df[['col1', 'col2']]

2.2 Filter rows using condition

df[df['col'] > 100]

2.3 Filter rows using .between()

df[df['col'].between(10, 50)]

2.4 Rename columns

df.rename(columns={'old_name': 'new_name'})

2.5 Sort values

df.sort_values(by='col', ascending=False)

2.6 Select rows using .loc (label-based)

df.loc[5]

2.7 Select rows using .iloc (position-based)

df.iloc[5]

2.8 Change datatypes using astypes

df = df.astype({'col_name': 'desired_dtype'})
# or 
df['col_name'] = df['col_name'].astype('desired_dtype')

Return all rows where the 'verified' column is NOT True

df[~df['col'] == 'target']

3. Aggregation Functions

3.1 Group by and sum

df.groupby('group_col')['value_col'].sum().reset_index()

3.2 Group by and count

df.groupby('group_col')['value_col'].count().reset_index()

3.3 Group by and count unique values

df.groupby('group_col')['value_col'].nunique().reset_index()

3.4 Group by with multiple aggregations using .agg()

df.groupby('group_col').agg({
    'col1': 'sum',
    'col2': 'mean',
    'col3': 'nunique'
}).reset_index()

4. Join and Union

4.1 Merge two DataFrames on different keys

df1.merge(df2, left_on='key1', right_on='key2', how='inner')

4.2 Merge using how='left', how='right', or how='outer'

df1.merge(df2, on='key', how='left')

4.3 Union two DataFrames (like SQL UNION ALL)

pd.concat([df1, df2], ignore_index=True)

4.4 Union with deduplication (like SQL UNION)

pd.concat([df1, df2], ignore_index=True).drop_duplicates()

5. Datetime Functions

5.1 Create a timestamp

pd.Timestamp('2025-04-12')

5.2 Create a time delta

pd.Timedelta(days=30)

5.3 Convert a column to datetime

df['date_col'] = pd.to_datetime(df['date_col'])

5.4 Extract day from datetime

df['day'] = df['date_col'].dt.day

5.5 Convert datetime to monthly period

df['month'] = df['date_col'].dt.to_period('M')

5.6 Calculate time difference in seconds

(df['end_time'] - df['start_time']).dt.total_seconds()

6. Null Handling

6.1 Check for null values

df['col'].isnull()

6.2 Filter rows with null values

df[df['col'].isnull()]

7. Duplicates Handling

7.1 Drop duplicate rows

df.drop_duplicates()

7.2 Drop duplicates based on specific columns

df.drop_duplicates(subset=['col1', 'col2'])

7.3 Keep the last occurrence of duplicates

df.drop_duplicates(keep='last')

7.4 Keep no duplicates at all

df[df.duplicated() == False]

7.5 Get only duplicate rows

df[df.duplicated()]

7.6 Mark duplicates with a boolean

df.duplicated()

8. CSV File Handling

8.1 Read a CSV file

df = pd.read_csv('data.csv')

8.2 Read a CSV with index column

df = pd.read_csv('data.csv', index_col=0)

8.3 Write a DataFrame to CSV

df.to_csv('output.csv', index=False)

8.4 Read a CSV with specific columns

df = pd.read_csv('data.csv', usecols=['col1', 'col2'])

8.5 Read only first N rows from a CSV

df = pd.read_csv('data.csv', nrows=100)

8.6 Skip initial rows while reading

df = pd.read_csv('data.csv', skiprows=1)

8.7 Handle missing values while reading

df = pd.read_csv('data.csv', na_values=['NA', 'null', 'NaN'])

About

🐼 A practical pandas cheatsheet with examples. Includes grouped syntax by category (inspection, manipulation, aggregation, joins, datetime, nulls) and a sample DataFrame to test all operations β€” perfect for interviews, studying, or daily reference.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •