Skip to content
View prasku5's full-sized avatar
😀
Hello there!!
😀
Hello there!!

Highlights

  • Pro
Block or Report

Block or report prasku5

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
prasku5/README.md
Leetcode Stats ovi

Profile views

LinkedIn

Blog - @prasku
School: Indiana University Bloomington Currently Playing Python IDE SQL IDE
Indiana University Bloomington
Spotify Profile
Python IDE

Your Image Description
SQL IDE

Your Image Description

Python Knowledge

Click item below to expand

Core Python Skills
Category Skills
Syntax and Semantics - Understanding Python’s syntax, data types, operators, and control flow (loops, conditionals).
Functions - Writing and using functions
- Lambda expressions
- Decorators
- Higher-order functions
Modules and Packages - Creating, importing, and managing Python modules and packages.
Object-Oriented Programming (OOP) - Classes
- Inheritance
- Polymorphism
- Encapsulation
- Designing robust class hierarchies
Error Handling - Exception handling using try, except, finally
- Custom exceptions
Iterators and Generators - Understanding iterators and generators
- Use cases for memory-efficient data processing
Comprehensions - List comprehensions
- Dictionary comprehensions
- Set comprehensions
Context Managers - Using with statements
- Creating custom context managers for resource management
Decorators and Metaprogramming - Using and creating decorators
- Understanding metaclasses for advanced programming
PySpark Topics
Category Topics
Introduction to PySpark Basics of Spark, Spark architecture, setting up PySpark environment.
RDD (Resilient Distributed Dataset) RDD operations (transformations and actions), creating RDDs, RDD lineage, fault tolerance.
DataFrames Creating DataFrames, schema management, data manipulation, SQL operations on DataFrames.
Spark SQL SQL functions, running SQL queries, DataFrame API vs. SQL API.
PySpark Functions User Defined Functions (UDFs), built-in functions (col, lit, when).
Data Processing Aggregations, joins, filtering, sorting, groupBy operations, handling duplicates.
Performance Optimization Lazy evaluation, caching/persisting, partitioning, broadcast variables.
Machine Learning Using MLlib for machine learning tasks, creating pipelines, model training and evaluation.
Data Handling Handling missing values (mean, median imputation), dealing with skewed data, handling duplicates.
Advanced Operations Window functions, working with large datasets, data streaming.
Error Handling & Debugging Common errors in PySpark, debugging PySpark jobs, logging.
Cluster Management Understanding different cluster managers (YARN, Mesos, Kubernetes, Standalone), resource allocation.
Data Serialization Efficient serialization methods for performance improvement.
Graph Processing Basics of GraphFrames, GraphX for graph processing.
Integration with Other Tools Working with Hadoop, Hive, Kafka for data ingestion and processing.
PySpark SQL Working with PySpark SQL, SparkSession, SparkContext, DataFrames.
Data Formats Reading and writing data in various formats like CSV, Parquet, JSON.
Algorithms Common algorithms for clustering, classification, regression, recommendation using MLlib.
PySpark Methods
Category Methods
RDD Operations - map()
- flatMap()
- filter()
- mapPartitions()
- reduce()
- reduceByKey()
- groupByKey()
- sortByKey()
- join()
- cogroup()
- distinct()
- union()
- intersection()
- subtract()
- zipWithIndex()
- cache()
- persist()
- countByValue()
- take()
- takeSample()
- sample()
- collect()
- first()
- foreach()
- saveAsTextFile()
- saveAsSequenceFile()
- saveAsObjectFile()
DataFrame Creation - createDataFrame()
- read()
- from_records()
- from_csv()
- from_json()
- from_parquet()
- from_avro()
- from_orc()
DataFrame Operations - select()
- filter()
- where()
- drop()
- withColumn()
- withColumnRenamed()
- alias()
- distinct()
- orderBy()
- sort()
- groupBy()
- agg()
- count()
- sum()
- avg()
- max()
- min()
- collect()
- head()
- show()
- describe()
- printSchema()
- dropDuplicates()
- limit()
- union()
- intersect()
- except()
- join()
- crossJoin()
- coalesce()
- repartition()
- cache()
- persist()
- toPandas()
SQL Functions - col()
- lit()
- when()
- expr()
- regexp_replace()
- concat()
- substr()
- length()
- lower()
- upper()
- trim()
- ltrim()
- rtrim()
- split()
- date_format()
- to_date()
- year()
- month()
- dayofmonth()
- hour()
- minute()
- second()
- datediff()
- current_date()
- current_timestamp()
- lag()
- lead()
- rank()
- dense_rank()
- row_number()
Aggregation Functions - sum()
- avg()
- min()
- max()
- count()
- agg()
- countDistinct()
- approxQuantile()
- pivot()
- cube()
- rollup()
Window Functions - window()
- rowsBetween()
- rangeBetween()
- partitionBy()
- orderBy()
- rank()
- dense_rank()
- row_number()
- lag()
- lead()
DataFrame I/O - read()
- write()
- save()
- load()
- csv()
- json()
- parquet()
- orc()
- avro()
- jdbc()
- hive()
Performance Optimization - cache()
- persist()
- unpersist()
- repartition()
- coalesce()
- broadcast()
- checkpoint()
- explain()
- explainMode()
Machine Learning (MLlib) - fit()
- transform()
- predict()
- evaluate()
- crossValidator()
- paramGrid()
- pipeline()
- VectorAssembler()
- StringIndexer()
- OneHotEncoder()
- RandomForestClassifier()
- LogisticRegression()
- DecisionTreeClassifier()
- KMeans()
- ALS()
- PCA()
- LSA()
- NaiveBayes()
- GradientBoostedTrees()
- LinearRegression()
- ClassificationEvaluator()
- RegressionEvaluator()
Data Handling - fillna()
- dropna()
- replace()
- dropDuplicates()
- na.fill()
- na.drop()
- na.replace()
- na.drop()
- na.fill()
- na.replace()
Graph Processing - graphFrames()
- pageRank()
- shortestPaths()
- connectedComponents()
- triangleCount()
- labelPropagation()
Error Handling & Debugging - try-except
- debug()
- log4j()
- info()
- warn()
- error()
- setLogLevel()
Cluster Management - standalone()
- yarn()
- mesos()
- kubernetes()
- submit()
- kill()
- status()
- logs()
- deploy()
Data Serialization - write.format()
- saveAsTable()
- load()
- to_parquet()
- to_csv()
- to_json()
Integration with Other Tools - hadoop()
- hive()
- kafka()
- jdbc()
- spark-submit()
- spark-shell()
- hdfs()
- sqoop()
Spark Performance Tuning
Category Technique Code Example
Resource Management Tune Executor Memory: Adjust the executor memory for optimal performance. python\nspark.conf.set("spark.executor.memory", "4g")\n
Tune Executor Cores: Adjust the number of cores per executor for better performance. python\nspark.conf.set("spark.executor.cores", "4")\n
Tune Driver Memory: Set the driver memory for better resource management. python\nspark.conf.set("spark.driver.memory", "4g")\n
Tune Task Parallelism: Adjust the number of tasks to improve parallelism. python\nspark.conf.set("spark.default.parallelism", "100")\n
Tune Resource Allocation: Configure resource allocation based on workload requirements. python\n# Example configuration\n
Data Management Optimize Data Formats: Use efficient data formats like Parquet or ORC. python\n# Writing data in Parquet format\ndf.write.format("parquet").save("data.parquet")\n
Optimize File Compression: Use efficient compression formats like Snappy or Gzip. python\n# Writing data with Snappy compression\ndf.write.option("compression", "snappy").parquet("data.parquet")\n
Optimize File Reading: Use efficient file reading operations. python\n# Reading files efficiently\nspark.read.format("parquet").load("data.parquet")\n
Use Efficient Data Sources: Choose efficient data sources for faster access. python\n# Reading from a JDBC source\ndf = spark.read.format("jdbc").options(url="jdbc:mysql://localhost/test", dbtable="table").load()\n
Optimize Data Serialization: Use efficient serialization methods. python\n# Using Kryo serializer\nspark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\n
Use Efficient Data Partitioning: Partition data based on query requirements. python\n# Partitioning DataFrame by date\ndf_partitioned = df.repartition("date")\n
Use Efficient File Formats for Writes: Prefer formats like ORC or Parquet for writing data. python\n# Writing in ORC format\ndf.write.format("orc").save("data.orc")\n
Optimize Data Loading and Storage: Improve data loading and storage techniques. python\n# Efficient data loading and storage\ndf_loaded = spark.read.option("header", "true").csv("data.csv")\ndf_loaded.write.option("compression", "gzip").csv("data.csv")\n
Minimize Data Duplication: Remove duplicate data to reduce storage and computation costs. python\n# Removing duplicates\ndf_deduplicated = df.dropDuplicates()\n
Shuffle and Task Management Optimize Shuffle Operations: Minimize shuffle operations by optimizing transformations. python\n# Optimizing shuffle operations\ndf_filtered = df.filter(df["value"] > 10)\n
Tune Shuffle Partitions: Adjust the number of shuffle partitions for better performance. python\n# Setting shuffle partitions\nspark.conf.set("spark.sql.shuffle.partitions", "100")\n
Optimize DAG Execution: Analyze and optimize Directed Acyclic Graph (DAG) execution plans. python\n# Analyze DAG execution using Spark UI.\n
Avoid Long-Running Tasks: Break down long-running tasks into smaller tasks to improve performance. python\n# Example is dependent on specific task implementations.\n
Tune Task Scheduling: Optimize task scheduling to reduce resource contention. python\n# Task scheduling tuning is dependent on specific cluster configurations.\n
Join Optimization Use Efficient Joins: Optimize join operations to reduce processing time. python\n# Efficient join operation\ndf_joined = df1.join(df2, df1.key == df2.key, "inner")\n
Tune Broadcast Join Size: Configure the size of DataFrames for broadcast joins. python\nspark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")\n
Optimize Joins and Aggregations: Optimize join and aggregation operations for better performance. python\n# Efficient join and aggregation\ndf_result = df1.join(df2, df1.key == df2.key).groupBy("key").agg({"value": "sum"})\n
Handle Skewed Joins: Apply techniques to mitigate skew in join operations. python\n# Handling skewed joins using salting\ndf_skewed = df1.withColumn("salt", df1["key"] % 10).join(df2.withColumn("salt", df2["key"] % 10), ["key", "salt"])\n
Performance Analysis Monitor Resource Usage: Use Spark UI and monitoring tools to track resource usage and performance. python\n# Monitoring resource usage using Spark UI.\n
Optimize Execution Plans: Analyze and optimize Spark SQL execution plans for better performance. python\n# Analyzing execution plans\ndf.explain(True)\n
Use Efficient Data Filtering: Apply efficient filtering techniques to reduce data size. python\n# Efficient data filtering\ndf_filtered = df.filter(df["value"] > 100)\n
Caching and Persistence Use DataFrame Caching Wisely: Cache DataFrames that are reused multiple times in computations. python\n# Caching DataFrame\ndf.cache()\n
Optimize Cache Management: Manage cached data efficiently to avoid unnecessary memory usage. python\n# Unpersist cached DataFrame\ndf.unpersist()\n
Data Handling and Transformation Optimize Data Aggregations: Use optimized aggregation strategies for large datasets. python\n# Using optimized aggregation\nresult_df = df.groupBy("key").agg({"value": "max"})\n
Optimize Data Sorting: Use efficient sorting algorithms for large datasets. python\n# Sorting DataFrame efficiently\ndf_sorted = df.orderBy("key")\n
Use Efficient Data Transformations: Optimize data transformations for better performance. python\n# Efficient transformations\ndf_transformed = df.withColumn("new_col", df["old_col"] * 2)\n
Reduce Data Serialization Time: Optimize data serialization to reduce processing time. python\n# Optimizing serialization\nspark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\n
Avoid Redundant Computations: Minimize redundant computations to improve performance. python\n# Avoiding redundant computations\nresult_df = df.select("col1").distinct()\n
Handling Skew and Bottlenecks Avoid Data Skew: Use techniques to mitigate data skew during processing. python\n# Example data skew mitigation using salting\ndf_skewed = df.withColumn("salt", df["key"] % 10)\n
Minimize Data Movement: Reduce the movement of data across the network. python\n# Example code is dependent on network and cluster settings.\n
Configuration Tuning Tune Spark Configuration Parameters: Adjust Spark configuration parameters for better performance. python\n# Tuning configuration parameters\nspark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")\n
Tune Executor Log Level: Adjust executor log levels for debugging and performance. python\nspark.conf.set("spark.executor.logs.rolling.maxSize", "10MB")\n
File Management Optimize Data Reads: Use efficient data reading strategies for large files. python\n# Reading large files efficiently\ndf_large = spark.read.option("header", "true").csv("large_data.csv")\n
Use Efficient Data Loading: Load data efficiently to reduce processing time. python\n# Efficient data loading\ndf_loaded = spark.read.json("data.json")\n
Table Size Handling Handle Large Table Sizes: Optimize operations for tables in GBs, TBs, and PBs. python\n# Example techniques include using partitioning and optimizing joins. \n
Partition Large Tables: Partition large tables to improve processing efficiency. python\n# Partitioning large tables\ndf_partitioned = df.repartition("partition_col")\n
Optimize Data Writes for Large Tables: Use efficient data writing techniques for large tables. python\n# Efficient writing for large tables\ndf.write.mode("overwrite").parquet("large_table.parquet")\n
Join Types and Performance Broadcast Joins: Use broadcast joins for smaller DataFrames. python\n# Using broadcast join\nfrom pyspark.sql.functions import broadcast\ndf_broadcasted = df1.join(broadcast(df2), "key")\n
Shuffle Joins: Use shuffle joins for larger DataFrames that cannot fit in memory. python\n# Using shuffle join\ndf_shuffled = df1.join(df2, "key")\n
Bucketed Joins: Optimize joins by bucketing tables on the join column. python\n# Bucketing tables\nspark.conf.set("spark.sql.bucketing.enabled", "true")\n
Sort-Merge-Bucket Joins: Use sort-merge-bucket joins for efficient processing. python\n# Using sort-merge-bucket joins\nspark.conf.set("spark.sql.join.preferSortMergeJoin", "true")\n
Data Structures and Algorithms
Category Skills
Built-in Data Structures - Lists
- Tuples
- Sets
- Dictionaries
- Arrays
Algorithms - Sorting: Merge Sort, Quick Sort, Heap Sort
- Searching: Binary Search, Linear Search
Advanced Data Structures - Linked Lists: Singly, Doubly
- Stacks
- Queues: FIFO, Priority Queues
- Trees: Binary Trees, AVL Trees, B-trees
- Heaps: Min-Heap, Max-Heap
- Graphs: Adjacency Matrix, Adjacency List
Dynamic Programming - Techniques for solving complex problems by breaking them into simpler subproblems
Hashing - Hash Tables
- Handling Collisions: Chaining, Open Addressing
Tries - Efficiently handling prefixes and dynamic sets of strings
Disjoint Set (Union-Find) - Managing partitions and connected components in sets
Bit Manipulation - Techniques for efficient data manipulation using bitwise operations
Amortized Analysis - Analyzing the performance of algorithms over time considering occasional expensive operations
Data Handling and Processing
Category Skills
File Handling - CSV
- JSON
- XML
- Parquet
- Avro
- Other file formats used in data engineering
Data Parsing - Techniques for parsing and extracting data from various formats including JSON, XML, and custom formats
Data Manipulation - Using libraries like Pandas for data cleaning
- Transformation
- Aggregation
- Feature engineering
Data Integration - Combining data from multiple sources
- Handling schema evolution
- Dealing with inconsistencies
Data Transformation - Techniques for transforming data using ETL processes
- Tools like Apache NiFi, Talend, or custom Python scripts
Data Validation - Implementing data validation checks to ensure data quality and consistency
Streaming Data Processing - Handling real-time data streams using frameworks like Apache Kafka, Apache Flink, or Spark Streaming
Data Serialization - Understanding serialization formats (JSON, Protobuf, Avro)
- Impact on performance and interoperability
Data Normalization - Techniques for normalizing and denormalizing data for database design and performance optimization
Batch vs. Real-Time Processing - Differentiating between batch processing and real-time processing
- Understanding when to use each approach
Data Lineage - Tracking the origin and transformations of data
- Ensuring traceability and auditability
Design Patterns and Principles
Category Skills
Creational Patterns - Singleton
- Factory Method
- Abstract Factory
- Builder
- Prototype
Structural Patterns - Adapter
- Decorator
- Proxy
- Composite
- Flyweight
- Bridge
Behavioral Patterns - Observer
- Strategy
- Command
- Chain of Responsibility
- Mediator
- Memento
- State
- Template Method
- Visitor
Concurrency Patterns - Producer-Consumer
- Read-Write Lock
- Thread Pool
- Future/Promise
Architectural Patterns - Microservices
- Serverless Architecture
- Event-Driven Architecture
- CQRS (Command Query Responsibility Segregation)
Design Principles - SOLID Principles: Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, Dependency Inversion
- DRY (Don't Repeat Yourself)
- KISS (Keep It Simple, Stupid)
- YAGNI (You Aren't Gonna Need It)
Other Patterns - Facade
- Iterator
- Factory
APIs and Web Services
Category Skills
RESTful APIs - Creation and Consumption
- Authentication (OAuth, JWT)
- Rate Limiting
- Pagination
SOAP APIs - Understanding WSDL
- Handling XML-based communication
- SOAP Actions
Web Scraping - BeautifulSoup
- Scrapy
- Handling JavaScript-rendered content (Selenium)
GraphQL - Querying APIs with GraphQL
- Schema Definition
- Resolvers
- Query Optimization
API Testing - Using tools like Postman, Swagger, or Insomnia for API testing and documentation
API Documentation - Creating and managing API documentation using tools like Swagger/OpenAPI
API Security - Best practices for securing APIs, including encryption, authentication, and authorization
Microservices Communication - Using message brokers (e.g., RabbitMQ, Kafka) for inter-service communication
WebSockets - Understanding real-time communication with WebSockets for use cases like live data feeds
Service Orchestration - Techniques for orchestrating microservices and managing inter-service communication
API Gateways - Using API gateways (e.g., Kong, AWS API Gateway) for managing, securing, and monitoring API traffic
Testing and Debugging
Category Skills
Unit Testing - Using unittest, pytest for writing and executing test cases
- Test Fixtures
- Mocking
- Parameterized Tests
Integration Testing - Testing interactions between components or systems
- Using tools like pytest fixtures or unittest integration tests
Functional Testing - Ensuring the system functions as expected from an end-user perspective
- Using tools like pytest and Selenium for web applications
End-to-End Testing - Testing the complete flow of an application
- Using frameworks like Selenium, Cypress
Performance Testing - Assessing the performance of your application
- Load testing and stress testing
- Using tools like locust, JMeter
Security Testing - Identifying security vulnerabilities in your code or system
- Using tools like OWASP ZAP, Bandit for Python security analysis
Code Coverage - Measuring the percentage of code covered by tests
- Using tools like coverage.py or integrated coverage tools in CI/CD pipelines
Debugging Techniques - Using debugging tools and techniques (e.g., pdb, ipdb, IDE debuggers like those in PyCharm or VS Code)
Logging - Implementing logging strategies using Python’s logging module or third-party libraries to track application behavior and diagnose issues
Error Reporting - Using tools for capturing and reporting errors in production environments
- Examples: Sentry, Loggly
Static Code Analysis - Analyzing code for potential errors, stylistic issues, and code smells
- Using tools like pylint, flake8, black for code formatting
Continuous Integration (CI) - Setting up CI pipelines for automated testing
- Using services like GitHub Actions, Jenkins, Travis CI
Continuous Deployment (CD) - Implementing CD pipelines for automated deployment
- Ensuring integration and deployment in a smooth and consistent manner
Security and Compliance
Category Skills
Data Privacy - Understanding data privacy laws (e.g., GDPR, CCPA)
- Implementing data anonymization and pseudonymization techniques
- Ensuring compliance with legal requirements
Data Encryption - Implementing encryption at rest and in transit
- Understanding encryption algorithms (e.g., AES, RSA)
- Managing encryption keys using tools like AWS KMS, Azure Key Vault
Access Control - Implementing role-based access control (RBAC)
- Attribute-based access control (ABAC)
- Managing permissions and user roles in data systems
Audit Logging - Tracking and recording access to data and system changes
- Using tools and frameworks for log management
Security Best Practices - Following best practices for securing data systems
- Secure coding practices
- Regular vulnerability assessments
- Security patches
Network Security - Understanding and implementing network security measures
- Firewalls
- VPNs
- Network segmentation
Incident Response - Developing and implementing an incident response plan
- Detecting, responding to, and recovering from security incidents and breaches
Compliance Frameworks - Familiarity with industry standards and frameworks (e.g., HIPAA, SOC 2, ISO 27001)
- Ensuring systems and processes meet these standards
Data Governance - Establishing data governance practices
- Data stewardship
- Data quality
- Metadata management
Risk Management - Assessing and managing risks related to data security and compliance
- Conducting risk assessments
- Implementing risk mitigation strategies
Advanced Programming Concepts
Category Skills
Asynchronous Programming - Asynchronous I/O with asyncio
- Using await for handling asynchronous tasks
- Managing event loops
- Understanding concurrency patterns in Python
Concurrency - Using threading for I/O-bound tasks
- multiprocessing for CPU-bound tasks
- Understanding the Global Interpreter Lock (GIL)
- Implementing concurrent patterns
Metaprogramming - Using and creating decorators
- Understanding and using metaclasses for dynamic class creation
- Customizing class behavior at runtime
Coroutines - Writing and managing coroutines for efficient asynchronous code
- Understanding how coroutines differ from regular functions
Context Managers - Creating and using context managers for resource management
- Understanding with statements
- Implementing custom context managers
Dynamic Typing and Reflection - Leveraging Python’s dynamic typing and reflection capabilities
- Creating flexible and generic code
Memory Management - Understanding Python’s memory management
- Garbage collection
- Techniques for memory optimization
Design by Contract - Implementing preconditions, postconditions, and invariants
- Ensuring code reliability and correctness
Functional Programming - Utilizing functional programming techniques
- Higher-order functions
- Immutability
- Pure functions for cleaner code
Meta-Programming Techniques - Creating and using custom descriptors
- Dynamic attribute access
- Class factories for advanced programming scenarios
Data Preprocessing and Cleaning Operations
Category Function/Method Description Source
Handling Missing Values dropna, fillna, isnull, notnull Drop or fill missing values, check for nulls. freecodecamp.org, datagy.io
Removing Duplicates drop_duplicates, duplicated Identify and remove duplicate rows. freecodecamp.org, datagy.io
Data Type Conversion astype Convert data types. freecodecamp.org
String Operations str Vectorized string operations for cleaning text data. realpython.com
Sorting Data sort_values, sort_index Sort data by values or index. pickl.ai
Filtering Data query, eval Query and filter data efficiently. pandas.pydata.org
Column Manipulation rename, drop, set_index, reset_index Rename, drop columns, set and reset index. realpython.com
Aggregation and Grouping groupby, agg Group data and perform aggregate functions. pandas.pydata.org
Pivoting Data pivot_table, melt, stack, unstack Reshape data for analysis. stackoverflow.com, jakevdp.github.io
Merging and Joining merge, join, concat Combine data from multiple DataFrames. realpython.com
Visualization plot, hist, boxplot Visualize data directly from DataFrames. w3schools.com, DataCamp
Memory Optimization memory_usage, astype Optimize memory usage by changing data types. pandas.pydata.org
Pandas Concepts
Category Concepts
Basic Operations Importing Pandas, creating DataFrames, Series, basic data structures, DataFrame operations (add, subtract, multiply, divide).
Data Import/Export Reading data from CSV, Excel, JSON, SQL databases, and writing data to various formats (CSV, Excel, JSON).
Data Inspection Viewing data (head(), tail(), info(), describe(), shape, dtypes, columns).
Indexing and Selection Selecting data by rows and columns, using loc and iloc, boolean indexing, setting and resetting index.
Data Cleaning Handling missing data (dropna(), fillna()), detecting and removing duplicates, renaming columns.
Data Transformation Applying functions (apply(), map(), applymap()), vectorized operations, data type conversions, sorting data (sort_values(), sort_index()).
Grouping and Aggregation Grouping data with groupby(), performing aggregations (sum(), mean(), count(), min(), max(), agg()).
Merging and Joining Concatenating DataFrames (concat()), merging and joining DataFrames (merge(), join()).
Time Series Working with date and time data, generating date ranges, resampling, time-based indexing.
Pivoting and Reshaping Pivot tables (pivot_table()), reshaping data (melt(), stack(), unstack()).
Data Visualization Plotting with Pandas (plot(), hist(), boxplot()), integration with Matplotlib and Seaborn.
Advanced Indexing MultiIndex (hierarchical indexing), cross-section (xs()), advanced indexing with pd.IndexSlice.
Performance Optimization Efficient memory usage, using eval() and query() for efficient computations, optimizing data types.
Iterating Iterating over rows (iterrows(), itertuples()), columns, and applying functions across DataFrames.
Window Functions Rolling and expanding windows, applying window functions (rolling(), expanding(), ewm()).
Data Serialization Reading and writing from various formats like Parquet, HDF5, and Feather.
Integration with Other Libraries Integration with NumPy for numerical operations, using Pandas with Scikit-Learn for machine learning workflows.
NumPy Concepts
Category Concept Description Source
Array Creation np.array(), np.zeros(), np.ones() Creating arrays, arrays of zeros, and arrays of ones​ The Data Monk (NumPy)
Array Initialization np.empty(), np.full(), np.arange(), np.linspace() Creating uninitialized arrays, arrays with a specified value, sequences of numbers, and evenly spaced numbers​ Learn R, Python & Data Science Online
Basic Operations Addition, subtraction, multiplication, division Element-wise operations on arrays​ datagy (ProjectPro)
Indexing and Slicing Indexing, slicing, boolean indexing Accessing array elements, slicing arrays, and using boolean conditions to filter arrays​ datagy (Interview Query)
Reshaping and Resizing np.reshape(), np.resize(), np.flatten() Changing the shape and size of arrays, flattening arrays​ Interview Query (NumPy)
Aggregation Functions np.sum(), np.mean(), np.median(), np.std() Calculating sum, mean, median, and standard deviation of array elements​ The Data Monk (ProjectPro)
Broadcasting Broadcasting rules Applying operations on arrays of different shapes​ NumPy (Python Tutorials – Real Python)
Linear Algebra np.dot(), np.matmul(), np.linalg.inv(), np.linalg.eig() Dot products, matrix multiplication, matrix inversion, eigenvalues, and eigenvectors​ ProjectPro (NumPy)
Advanced Indexing Fancy indexing, index arrays Using arrays of indices to access elements​ Interview Query (NumPy)
Data Types dtype Specifying and understanding data types within arrays​ ProjectPro (NumPy)
File I/O np.save(), np.load(), np.savetxt(), np.loadtxt() Saving and loading arrays to/from files​ NumPy
Random Sampling np.random.rand(), np.random.randint(), np.random.choice() Generating random numbers and sampling from arrays​ The Data Monk (Learn R, Python & Data Science Online)
Statistical Functions np.percentile(), np.histogram(), np.bincount() Calculating percentiles, histograms, and counts of unique values​ Python Tutorials – Real Python (ProjectPro)
Boolean Operations Logical operations Applying logical operations on arrays, such as np.any(), np.all()​ datagy (ProjectPro)
Sorting and Searching np.sort(), np.argsort(), np.searchsorted() Sorting arrays and searching for elements​ The Data Monk (ProjectPro)
Set Operations np.unique(), np.intersect1d(), np.union1d() Performing set operations on arrays​ Learn R, Python & Data Science Online (NumPy)
Data Cleaning Handling missing values, outlier detection Removing null values, identifying and handling outliers​ The Data Monk
Timeseries Analysis
Category Function/Method Description Source
Resampling resample Resample time series data to different frequencies. pandas.pydata.org
Rolling Windows rolling Apply rolling window calculations. pandas.pydata.org
Expanding Windows expanding Apply expanding window calculations. pandas.pydata.org
EWM ewm Apply exponential weighted functions. pandas.pydata.org
Datetime Indexing DatetimeIndex Indexing with date ranges. jakevdp.github.io
Time Period Indexing PeriodIndex Indexing with fixed-frequency intervals. jakevdp.github.io
Timestamps and Date Handling
Category Function/Method Description Source
Creating Timestamps pd.Timestamp Create a timestamp object. pandas.pydata.org
Date Conversion pd.to_datetime Convert strings, epochs, or a mixture to DatetimeIndex. pandas.pydata.org
Period Handling pd.Period Represent a time span. pandas.pydata.org
Timedelta Handling pd.Timedelta Represent a time duration. jakevdp.github.io
Date Range Generation pd.date_range Generate a fixed frequency date range. jakevdp.github.io
Date Offsets pd.offsets Create date offsets for business days, months, years, etc. pandas.pydata.org
Datetime Properties .dt Accessor to extract date properties like year, month, day, weekday, etc. pandas.pydata.org

SQL Knowledge

Click item below to expand

Database Skills
Category Skills
Basic SQL Operations - SELECT, INSERT, UPDATE, DELETE
- JOINs, WHERE, GROUP BY, ORDER BY
Joins - INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN
Subqueries - Nested queries
- Correlated subqueries
Indexes - Creating, using, and optimizing indexes for query performance
Views - Creating, managing, and using database views for abstraction and security
Transactions - ACID properties
- COMMIT, ROLLBACK, SAVEPOINT
Stored Procedures - Writing and managing stored procedures for encapsulating complex logic
Functions - Creating and using functions for reusable code
Triggers - Creating triggers for automatic actions in response to database changes
Constraints - Defining primary keys, foreign keys, unique constraints, and check constraints
Backup and Recovery - Strategies for database backup and recovery
- Point-in-time recovery
Performance Tuning - Query optimization
- Index tuning
- Execution plans
- Profiling and analyzing performance
Security Management - User roles
- Permissions
- Encryption
- Securing database access
Partitioning - Horizontal and vertical partitioning for managing large datasets
Replication - Database replication techniques for redundancy and high availability
Sharding - Data sharding strategies to distribute data across multiple servers
Data Warehousing - Designing and implementing data warehouses for large-scale data storage and analysis
ETL (Extract, Transform, Load) - Processes for extracting data from sources, transforming it, and loading it into a data warehouse
Data Lakes - Creating and managing data lakes for storing raw and unstructured data
Recursive Queries - Writing recursive queries to handle hierarchical data
Window Functions - Using window functions for advanced analytics and calculations
Pivot and Unpivot - Transforming data for better analysis and reporting
Data Types - Understanding and using various data types, including numeric, string, date/time, and custom types
String Manipulation - Functions and techniques for manipulating and querying string data
SQL Practice websites
Site Name Link
Data Lemur datalemur.com
SQLBolt sqlbolt.com
StrataScratch stratascratch.com
w3resource w3resource.com
SQL Functions and Keywords
Category Functions/Keywords
Basic SQL Operations - SELECT
- INSERT
- UPDATE
- DELETE
- JOIN
- WHERE
- GROUP BY
- ORDER BY
Joins - INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN
- CROSS JOIN
Subqueries - Nested queries
- Correlated subqueries
Indexes - CREATE INDEX
- DROP INDEX
- ALTER INDEX
- INDEX
Views - CREATE VIEW
- DROP VIEW
- ALTER VIEW
- VIEW
Transactions - BEGIN TRANSACTION
- COMMIT
- ROLLBACK
- SAVEPOINT
- SET TRANSACTION
Stored Procedures - CREATE PROCEDURE
- ALTER PROCEDURE
- DROP PROCEDURE
- EXECUTE
- CALL
Functions - CREATE FUNCTION
- ALTER FUNCTION
- DROP FUNCTION
- RETURN
- CALL
Triggers - CREATE TRIGGER
- ALTER TRIGGER
- DROP TRIGGER
- BEFORE
- AFTER
- INSTEAD OF
Constraints - PRIMARY KEY
- FOREIGN KEY
- UNIQUE
- CHECK
- NOT NULL
- DEFAULT
Backup and Recovery - BACKUP DATABASE
- RESTORE DATABASE
- BACKUP LOG
- RESTORE LOG
- WITH NORECOVERY
Performance Tuning - EXPLAIN
- ANALYZE
- OPTIMIZE
- INDEX
- PROFILE
- STATISTICS
Security Management - GRANT
- REVOKE
- DENY
- USER
- ROLE
- PERMISSIONS
Partitioning - PARTITION BY
- SUBPARTITION BY
- ALTER PARTITION
- DROP PARTITION
Replication - MASTER
- SLAVE
- REPLICA
- REPLICATION
- SYNC
- ASYNC
Sharding - SHARD
- SHARDING KEY
- ROUTING
- DISTRIBUTED
Data Warehousing - OLAP
- OLTP
- CUBE
- ROLLUP
- DRILLDOWN
- DIMENSION
- FACT
ETL (Extract, Transform, Load) - ETL
- EXTRACT
- TRANSFORM
- LOAD
- DATA INTEGRATION
Data Lakes - DATA LAKE
- RAW DATA
- UNSTRUCTURED DATA
- SCHEMA ON READ
Recursive Queries - WITH RECURSIVE
- UNION ALL
- CTE
- RECURSIVE
Window Functions - ROW_NUMBER
- RANK
- DENSE_RANK
- NTILE
- PARTITION BY
- ORDER BY
- OVER
Pivot and Unpivot - PIVOT
- UNPIVOT
- CROSS TAB
- TRANSFORM
Data Types - INT
- VARCHAR
- CHAR
- TEXT
- DATE
- TIME
- FLOAT
- BOOLEAN
- ARRAY
- JSON
String Manipulation - CONCAT
- SUBSTRING
- CHAR_LENGTH
- TRIM
- REPLACE
- SPLIT
- UPPER
- LOWER
Timestamp Functions and Format Conversion
Database Functions/Keywords
Oracle - CURRENT_TIMESTAMP
- SYSTIMESTAMP
- LOCALTIMESTAMP
- SYSDATE
- CURRENT_DATE
- EXTRACT
- TO_TIMESTAMP
- TO_DATE
- CAST
- DBTIMEZONE
- SESSIONTIMEZONE
- ADD_MONTHS
- MONTHS_BETWEEN
- NEXT_DAY
- LAST_DAY
- ROUND
- TRUNC
- Format Conversion
- TO_CHAR (e.g., TO_CHAR(date_column, 'YYYY-MM-DD HH24:MI:SS'))

- TO_TIMESTAMP (e.g., TO_TIMESTAMP('2024-07-25 12:34:56', 'YYYY-MM-DD HH24:MI:SS'))

- TO_DATE (e.g., TO_DATE('2024-07-25', 'YYYY-MM-DD'))

- FROM_TZ (e.g., FROM_TZ(TIMESTAMP '2024-07-25 12:34:56', 'UTC'))
PostgreSQL - CURRENT_TIMESTAMP
- NOW()
- LOCALTIMESTAMP
- CURRENT_DATE
- CURRENT_TIME
- EXTRACT
- DATE_TRUNC
- TO_TIMESTAMP
- TO_DATE
- AGE
- DATE_PART
- INTERVAL
- TIMEZONE
- FORMAT_TIMESTAMP
- JUSTIFY_DAYS
- JUSTIFY_HOURS
- Format Conversion
- TO_CHAR (e.g., TO_CHAR(timestamp_column, 'YYYY-MM-DD HH24:MI:SS'))

- TO_TIMESTAMP (e.g., TO_TIMESTAMP('2024-07-25 12:34:56', 'YYYY-MM-DD HH24:MI:SS'))

- TO_DATE (e.g., TO_DATE('2024-07-25', 'YYYY-MM-DD'))

- TO_CHAR (e.g., TO_CHAR(timestamp_column, 'Month DD, YYYY'))
MySQL - NOW()
- CURRENT_TIMESTAMP
- CURDATE()
- CURTIME()
- SYSDATE()
- UTC_TIMESTAMP()
- DATE_FORMAT
- STR_TO_DATE
- TIMESTAMPADD
- TIMESTAMPDIFF
- ADDDATE
- SUBDATE
- DATE_ADD
- DATE_SUB
- EXTRACT
- UNIX_TIMESTAMP
- FROM_UNIXTIME
- Format Conversion
- DATE_FORMAT (e.g., DATE_FORMAT(timestamp_column, '%Y-%m-%d %H:%i:%s'))
- Formats a timestamp or date according to the specified format string
- STR_TO_DATE (e.g., STR_TO_DATE('2024/07/25 12:34:56', '%Y/%m/%d %H:%i:%s'))
- Converts a string to a date or timestamp using a different custom format
- DATE_FORMAT (e.g., DATE_FORMAT(timestamp_column, '%W, %M %e, %Y %h:%i %p'))
- Formats a timestamp into a custom string format such as 'Wednesday, July 25, 2024 12:34 PM'
Data Quality Checks
Category Concept Description Example SQL Code Sources
Completeness Missing Values Ensuring no critical data fields are missing values. Check for NULL values in critical columns. SELECT * FROM table WHERE critical_column IS NULL; Monte Carlo Data, Airbyte, ProjectPro
Uniqueness Duplicate Records Ensuring that records are unique and not duplicated. Use DISTINCT or unique constraints in SQL. SELECT DISTINCT * FROM table;
ALTER TABLE table ADD CONSTRAINT unique_constraint UNIQUE(column);
Monte Carlo Data, DataRobot, DvSum
Accuracy Data Validity Verifying that the data conforms to the required format and range. Validate email formats and numerical ranges. SELECT * FROM table WHERE email NOT LIKE '%@%.%';
SELECT * FROM table WHERE age < 0 OR age > 120;
Acceldata, Databricks, IBM
Consistency Referential Integrity Ensuring that foreign keys in the data match the corresponding primary keys. Foreign key constraints in SQL. ALTER TABLE child_table ADD CONSTRAINT fk_name FOREIGN KEY (foreign_key_column) REFERENCES parent_table(primary_key_column); Monte Carlo Data, Datafold
Timeliness Data Freshness Ensuring data is up-to-date and reflects the most recent information. Timestamp checks to ensure data is recent. SELECT * FROM table WHERE last_updated > NOW() - INTERVAL 1 DAY; Acceldata, ProjectPro
Validity Schema Validation Ensuring that data conforms to predefined schema and structure. Validate data types and field lengths. SELECT * FROM table WHERE LENGTH(column) > 255;
SELECT * FROM table WHERE column NOT LIKE 'valid_format';
Databricks, Datafold
Integrity Data Lineage Tracking the data flow and transformations from source to destination. Maintain metadata about data transformations. -- Example not SQL based, but typically handled through ETL tools or metadata management systems. Oxagile, IABAC
Volume Data Volume Ensuring that the volume of data matches expectations. Check row counts before and after ETL processes. SELECT COUNT(*) FROM table; Monte Carlo Data, DvSum
Range Checks Value Range Ensuring numerical values fall within expected ranges. Validate that ages fall within 0-120 years. SELECT * FROM table WHERE age < 0 OR age > 120; KDnuggets, DataRobot
Pattern Matching String Patterns Ensuring string data follows a predefined pattern, such as email or phone number formats. Use regular expressions to validate phone numbers. SELECT * FROM table WHERE phone_number NOT REGEXP '^[0-9]{10}$'; Airbyte, ProjectPro
Distribution Checks Data Distribution Ensuring data is distributed as expected, often following a specific statistical distribution. Validate that sales figures follow a normal distribution. -- Example not SQL based, typically done through statistical analysis tools. Monte Carlo Data, DataRobot, KDnuggets
Anomaly Detection Outlier Detection Identifying and handling outliers that deviate significantly from the norm. Use statistical methods to identify outliers in sales data. SELECT * FROM table WHERE sales < (SELECT AVG(sales) - 3 * STDDEV(sales) FROM table) OR sales > (SELECT AVG(sales) + 3 * STDDEV(sales) FROM table); DvSum, DataRobot
Schema Drift Detecting Schema Changes Monitoring for unexpected changes in data schema. Compare current schema with historical schema to detect changes. -- Example not SQL based, often done through schema comparison tools. Redpanda, Datafold, DataRobot
Volume Checks Expected Data Volumes Ensuring the data volumes match historical trends and expectations. Validate that the number of transactions per day is within the expected range. SELECT COUNT(*) FROM transactions WHERE transaction_date = CURRENT_DATE; Monte Carlo Data, DvSum
Null Constraints Ensuring Non-NULL Values Ensuring that certain columns do not contain NULL values. Set NOT NULL constraints on essential columns. ALTER TABLE table MODIFY column_name datatype NOT NULL; Databricks, Monte Carlo Data
Automated Testing CI/CD for Data Quality Implementing continuous integration and delivery practices to automatically test data quality during the pipeline. Use CI tools like CircleCI to automate data quality checks on data model changes. -- Example not SQL based, involves using CI/CD tools and scripts. Datafold, IABAC
Data Profiling Statistical Analysis Analyzing the structure, content, and relationships within the data to understand its quality. Use data profiling tools to generate statistics about the dataset. -- Example not SQL based, typically done through profiling tools. Airbyte, ProjectPro
Data Cleansing Data Cleaning Removing or correcting inaccurate records from a dataset. Use tools like OpenRefine to clean messy data. -- Example not SQL based, often done through data cleaning tools. Airbyte, ProjectPro
Monitoring and Alerts Real-Time Data Monitoring Continuously monitoring data quality and setting up alerts for any anomalies or issues. Use Apache Griffin to monitor data quality across platforms in real-time. -- Example not SQL based, involves monitoring tools and alert configurations. Redpanda, DvSum
Domain Constraints Ensuring Domain-Specific Constraints Ensuring that data values fall within a specific domain or set of allowable values. Enforce domain-specific rules like product categories in an e-commerce database. SELECT * FROM table WHERE product_category IN ('Electronics', 'Clothing', 'Books'); Monte Carlo Data, Datafold
Compliance Checks Regulatory Compliance Ensuring data complies with relevant regulations and standards (e.g., GDPR, CCPA). Anonymize personal data to comply with GDPR requirements. -- Example not SQL based, involves anonymization processes. IABAC, Oxagile
Granularity and Relevance Appropriate Data Granularity Ensuring data granularity matches the intended use and is relevant for analysis. Adjust data granularity to avoid excessive detail or too little detail. -- Example not SQL based, often done through ETL and data transformation processes. IBM, Acceldata
Version Control Tracking Changes to Data Maintaining versions of data to track changes and ensure consistency. Use version control systems for managing data schemas and transformations. -- Example not SQL based, involves using version control systems for schemas. Datafold, IABAC
Impact Analysis Analyzing Impact of Data Changes Assessing the impact of data model changes on existing data and processes. Conduct impact analysis to ensure changes do not disrupt existing data pipelines. -- Example not SQL based, involves impact analysis tools and processes. DvSum, Datafold
String Functions
Function Description Example Usage Result
CONCAT() Concatenates two or more strings. SELECT CONCAT('Hello', ' ', 'World'); Hello World
SUBSTRING() Extracts a substring from a string. SELECT SUBSTRING('Hello World', 1, 5); Hello
LENGTH() Returns the length of a string. SELECT LENGTH('Hello'); 5
UPPER() Converts a string to uppercase. SELECT UPPER('hello'); HELLO
LOWER() Converts a string to lowercase. SELECT LOWER('HELLO'); hello
TRIM() Removes leading and trailing spaces from a string. SELECT TRIM(' Hello '); Hello
REPLACE() Replaces all occurrences of a substring within a string. SELECT REPLACE('Hello World', 'World', 'SQL'); Hello SQL
CHARINDEX() Returns the position of the first occurrence of a substring in a string. SELECT CHARINDEX('World', 'Hello World'); 7
RIGHT() Extracts the right part of a string with a specified length. SELECT RIGHT('Hello World', 5); World
LEFT() Extracts the left part of a string with a specified length. SELECT LEFT('Hello World', 5); Hello
POSITION() Returns the position of the first occurrence of a substring within a string. SELECT POSITION('World' IN 'Hello World'); 7
LTRIM() Removes leading spaces from a string. SELECT LTRIM(' Hello'); Hello
RTRIM() Removes trailing spaces from a string. SELECT RTRIM('Hello '); Hello
REVERSE() Reverses the characters in a string. SELECT REVERSE('Hello'); olleH
FORMAT() Formats a number or date value according to a specified format. SELECT FORMAT(1234567.89, '###,###.##'); 1,234,567.89
Number Functions
Function Description Example Usage Result
ROUND() Rounds a numeric field to the number of decimals specified. SELECT ROUND(123.4567, 2); 123.46
FLOOR() Rounds a number down to the nearest integer. SELECT FLOOR(123.4567); 123
CEILING() Rounds a number up to the nearest integer. SELECT CEILING(123.4567); 124
ABS() Returns the absolute value of a number. SELECT ABS(-123.456); 123.456
RAND() Returns a random float value between 0 and 1. SELECT RAND(); 0.1234
PI() Returns the value of π (pi). SELECT PI(); 3.14159
POWER() Raises a number to the power of another number. SELECT POWER(2, 3); 8
SQRT() Returns the square root of a number. SELECT SQRT(16); 4
LOG() Returns the natural logarithm of a number. SELECT LOG(10); 2.3026
EXP() Returns e raised to the power of a number. SELECT EXP(1); 2.7183
CAST() Converts a value from one data type to another. SELECT CAST('123' AS INT); 123
CONVERT() Converts a value from one data type to another. SELECT CONVERT(VARCHAR, 123); 123
SUM() Returns the sum of a numeric column. SELECT SUM(salary) FROM employees; 50000
AVG() Returns the average value of a numeric column. SELECT AVG(salary) FROM employees; 5000
MIN() Returns the minimum value of a numeric column. SELECT MIN(salary) FROM employees; 3000
MAX() Returns the maximum value of a numeric column. SELECT MAX(salary) FROM employees; 8000
Useful Keywords
Keyword Description Example Usage Result
CASE Provides conditional logic in SQL queries. SELECT CASE WHEN age > 18 THEN 'Adult' ELSE 'Minor' END FROM users; Adult or Minor
COALESCE() Returns the first non-null value in a list of arguments. SELECT COALESCE(NULL, NULL, 'Hello', 'World'); Hello
NULLIF() Returns NULL if two expressions are equal. SELECT NULLIF(10, 10); NULL
ISNULL() Replaces NULL with a specified replacement value. SELECT ISNULL(NULL, 'Default'); Default
GREATEST() Returns the greatest value among the arguments. SELECT GREATEST(1, 2, 3); 3
LEAST() Returns the smallest value among the arguments. SELECT LEAST(1, 2, 3); 1
DISTINCT Removes duplicate values from the result set. SELECT DISTINCT column_name FROM table_name; Varies
LIMIT Limits the number of rows returned by a query. SELECT * FROM table_name LIMIT 10; First 10 rows
OFFSET Skips a number of rows before starting to return rows. SELECT * FROM table_name LIMIT 10 OFFSET 20; Rows 21-30
JOIN Combines rows from two or more tables based on a related column. SELECT * FROM table1 JOIN table2 ON table1.id = table2.id; Combined rows
GROUP BY Groups rows that have the same values into summary rows. SELECT department, COUNT(*) FROM employees GROUP BY department; Count per department
HAVING Filters records after grouping them with GROUP BY. SELECT department, COUNT(*) FROM employees GROUP BY department HAVING COUNT(*) > 5; Groups with more than 5 employees
ORDER BY Sorts the result set by one or more columns. SELECT * FROM table_name ORDER BY column_name ASC; Sorted rows
UNION Combines the result sets of two or more SELECT statements. SELECT column_name FROM table1 UNION SELECT column_name FROM table2; Combined result sets
Regex Patterns
Pattern Description Example Usage Result
. Matches any single character except newline. SELECT REGEXP_MATCH('Hello', '.'); H, e, l, l, o
^ Matches the start of a string. SELECT REGEXP_MATCH('Hello', '^H'); H
$ Matches the end of a string. SELECT REGEXP_MATCH('Hello', 'o$'); o
* Matches 0 or more of the preceding element. SELECT REGEXP_MATCH('Hello', 'l*'); l, l
+ Matches 1 or more of the preceding element. SELECT REGEXP_MATCH('Hello', 'l+'); l, l
? Matches 0 or 1 of the preceding element. SELECT REGEXP_MATCH('Hello', 'l?'); l, ``
{n} Matches exactly n occurrences of the preceding element. SELECT REGEXP_MATCH('Hello', 'l{2}'); ll
{n,} Matches n or more occurrences of the preceding element. SELECT REGEXP_MATCH('Helloo', 'o{2,}'); oo
{n,m} Matches between n and m occurrences of the preceding element. SELECT REGEXP_MATCH('Hellooo', 'o{2,4}'); ooo
[] Matches any one of the characters inside the brackets. SELECT REGEXP_MATCH('Hello', '[Hh]ello'); Hello, hello
[^ ] Matches any one character not inside the brackets. SELECT REGEXP_MATCH('Hello', '[^H]ello'); ello
` ` Acts as an OR operator. `SELECT REGEXP_MATCH('Hello', 'H
() Groups patterns together. `SELECT REGEXP_MATCH('Hello', '(H h)ello');`
\ Escapes a special character. SELECT REGEXP_MATCH('Hello$', 'Hello\$'); Hello$
\d Matches any digit (0-9). SELECT REGEXP_MATCH('123', '\d'); 1, 2, 3
\D Matches any non-digit character. SELECT REGEXP_MATCH('abc', '\D'); a, b, c
\w Matches any word character (alphanumeric & underscore). SELECT REGEXP_MATCH('word_123', '\w'); w, o, r, d, _, 1, 2, 3
\W Matches any non-word character. SELECT REGEXP_MATCH('word_123!', '\W'); _, !
\s Matches any whitespace character. SELECT REGEXP_MATCH('hello world', '\s');
\S Matches any non-whitespace character. SELECT REGEXP_MATCH('hello world', '\S'); h, e, l, l, o, w, o, r, l, d
More Regex patterns and Use Cases
Pattern Description Example Usage Result
^\d+$ Matches strings that contain only digits. SELECT REGEXP_MATCH('12345', '^\d+$'); 12345
\bword\b Matches the word 'word' as a whole word (word boundary). SELECT REGEXP_MATCH('word', '\bword\b'); word
(\d{2}-\d{2}-\d{4}) Matches dates in the format 'DD-MM-YYYY'. SELECT REGEXP_MATCH('12-01-2024', '(\d{2}-\d{2}-\d{4})'); 12-01-2024
(\d+)\s(\w+) Matches a number followed by a space and then a word. SELECT REGEXP_MATCH('123 hello', '(\d+)\s(\w+)'); 123 hello
(?i)hello Case-insensitive match for the word 'hello'. SELECT REGEXP_MATCH('Hello', '(?i)hello'); Hello
^([a-z]+)\1$ Matches strings with repeating sequences of lowercase letters. SELECT REGEXP_MATCH('abab', '^([a-z]+)\1$'); abab
`(?:abc def)\d+` Matches either 'abc' or 'def' followed by one or more digits. `SELECT REGEXP_MATCH('def123', '(?:abc
(?<=\s)\d+ Matches a number that is preceded by a whitespace character (positive lookbehind). SELECT REGEXP_MATCH('The price is 100', '(?<=\s)\d+'); 100
(?=\d{2})\d+ Matches a number if it is followed by exactly two digits (positive lookahead). SELECT REGEXP_MATCH('12345', '(?=\d{2})\d+'); 12345
\b(?:\d{3}-\d{2}-\d{4})\b Matches a social security number format 'XXX-XX-XXXX' as a whole word. SELECT REGEXP_MATCH('123-45-6789', '\b(?:\d{3}-\d{2}-\d{4})\b'); 123-45-6789
Regex for Preprocessing Challenges
Challenge Example Pattern Description SQL Example
Extracting email addresses [\w\.-]+@[\w\.-]+\.\w+ Matches standard email addresses. SELECT REGEXP_MATCH(text_column, '[\w\.-]+@[\w\.-]+\.\w+');
Removing non-alphanumeric characters [^\w\s] Removes punctuation and special characters. SELECT REGEXP_REPLACE(text_column, '[^\w\s]', '');
Validating phone numbers in various formats \(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} Matches phone numbers with different separators. SELECT REGEXP_MATCH(phone_column, '\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}');
Extracting dates in 'YYYY-MM-DD' format \d{4}-\d{2}-\d{2} Matches dates in 'YYYY-MM-DD' format. SELECT REGEXP_MATCH(date_column, '\d{4}-\d{2}-\d{2}');
Splitting full names into first and last names (\w+)\s(\w+) Matches and splits full names into first and last names. SELECT REGEXP_MATCH(full_name_column, '(\w+)\s(\w+)');
Extracting IP addresses \b(?:\d{1,3}\.){3}\d{1,3}\b Matches IPv4 addresses. SELECT REGEXP_MATCH(ip_column, '\b(?:\d{1,3}\.){3}\d{1,3}\b');
Identifying URLs https?://[^\s/$.?#].[^\s]* Matches URLs starting with http or https. SELECT REGEXP_MATCH(url_column, 'https?://[^\s/$.?#].[^\s]*');
Validating credit card numbers \b(?:\d[ -]*?){13,16}\b Matches credit card numbers with 13 to 16 digits. SELECT REGEXP_MATCH(cc_column, '\b(?:\d[ -]*?){13,16}\b');
Matching hashtags #\w+ Matches hashtags starting with #. SELECT REGEXP_MATCH(text_column, '#\w+');
Extracting monetary amounts \$\d+(?:,\d{3})*(?:\.\d{2})? Matches monetary amounts with optional cents. SELECT REGEXP_MATCH(amount_column, '\$\d+(?:,\d{3})*(?:\.\d{2})?');
Finding repeated words \b(\w+)\s+\1\b Matches repeated words. SELECT REGEXP_MATCH(text_column, '\b(\w+)\s+\1\b');
Identifying hexadecimal color codes #(?:[0-9a-fA-F]{3}){1,2} Matches hexadecimal color codes. SELECT REGEXP_MATCH(color_column, '#(?:[0-9a-fA-F]{3}){1,2}');
Extracting domain names from URLs https?://(?:www\.)?([^/\s]+) Extracts domain names from URLs. SELECT REGEXP_MATCH(url_column, 'https?://(?:www\.)?([^/\s]+)');
Matching US ZIP codes \b\d{5}(?:-\d{4})?\b Matches standard and extended US ZIP codes. SELECT REGEXP_MATCH(zip_column, '\b\d{5}(?:-\d{4})?\b');
Extracting file extensions \.\w+$ Matches file extensions. SELECT REGEXP_MATCH(file_column, '\.\w+$');
Validating URLs with specific TLDs `https?://[^\s/$.?#].[^\s]*.(com org net
Removing HTML tags <[^>]+> Removes HTML tags from text. SELECT REGEXP_REPLACE(html_column, '<[^>]+>', '');
Matching dates in 'MM/DD/YYYY' format `(?:0[1-9] 1[0-2])/(?:0[1-9] [12][0-9]
Finding IP addresses in text (?:\d{1,3}\.){3}\d{1,3} Matches any IP address. SELECT REGEXP_MATCH(text_column, '(?:\d{1,3}\.){3}\d{1,3}');
SQL Order of execution
Stage Description Example Query Component References
1. FROM Determines the tables and joins used in the query. FROM table1 INNER JOIN table2 ON table1.id = table2.id MySQL Documentation: FROM
2. ON Applies the conditions for the join operation. ON table1.id = table2.id MySQL Documentation: JOIN
3. JOIN Processes the joins as specified in the FROM clause. INNER JOIN table2 MySQL Documentation: JOIN
4. WHERE Filters the rows based on specified conditions. WHERE table1.value > 100 MySQL Documentation: WHERE
5. GROUP BY Groups the rows based on specified columns. GROUP BY table1.category MySQL Documentation: GROUP BY
6. HAVING Filters the groups based on specified conditions. HAVING COUNT(*) > 5 MySQL Documentation: HAVING
7. SELECT Selects the columns or expressions to be returned. SELECT table1.id, COUNT(table2.id) MySQL Documentation: SELECT
8. DISTINCT Removes duplicate rows from the result set. SELECT DISTINCT column_name MySQL Documentation: DISTINCT
9. ORDER BY Sorts the result set based on specified columns. ORDER BY table1.value DESC MySQL Documentation: ORDER BY
10. LIMIT Limits the number of rows returned by the query. LIMIT 10 MySQL Documentation: LIMIT
11. OFFSET Skips a specified number of rows before starting to return rows (used with LIMIT). LIMIT 10 OFFSET 20 MySQL Documentation: LIMIT
12. UNION Combines results from multiple SELECT statements (if used). SELECT column_name FROM table1 UNION SELECT column_name FROM table2 MySQL Documentation: UNION
13. Friendly Jellyfish Will Gather Happy Sea Dolphins Over Lagoons - - -
Problem Solving Tips
Condition Tip
Each keyword Put that column-related table in the LEFT JOIN
Nth highest salary Use DENSE_RANK() to avoid gaps; alternatives include using two MAX() functions, two LIMIT clauses, or DENSE_RANK() with CTE
Null based salary Use IFNULL(DISTINCT SELECT CONDITION, NULL) AS destination_column_name
Consecutive numbers 1) Use EQUI JOIN with (current_table.field = duplicate_table1.field + 1) and (current_table.field = duplicate_table2.field + 2)
2) Use RANK() and SUM() with a check that the sum equals exactly 3; Group by the interested field
Parent/Child relationship EQUI JOIN based on associated columns and filter using conditions in the JOIN itself
Finding duplicates Group by key and use HAVING > some_value
Alternatively, use ROW_NUMBER() on key and filter RN > 1
Find results from Base Table not present Use JOIN results where not in (entire other table results)
Alternatively, use CTE results and then JOIN results
Find results from other table Left anti join to get results from the first table; use LEFT JOIN where the other_table.join_column is NULL (e.g., in PySpark)
Find results from based table using condition Use JOIN results where IN (condition-based results)
For example, find max top salaried employees
Target column (compare) previous all values Use EQUI JOIN (WHERE) with DATE_DIFF() and conditions
Alternatively, use INNER JOIN (ON) with DATE_DIFF() and conditions
First logged date Use MIN(date)
Sum of all previous until current column Use SUM(target_count_column) OVER (PARTITION BY column ORDER BY column ASC/DESC) AS target_count_until_current_row
Alternatively, use EQUI JOIN with (current column condition, other table column) (e.g., t1.event_date > t2.event_date)
FILTER CONDITION (If needs null value in result) Use WHERE condition IS NULL to include NULL results in the output
ORDER BY CTE (2 derived columns from it) Order by the derived columns (e.g., ORDER BY derived_column1, derived_column2)
HAVING Use HAVING for filtration based on group by columns or specific conditions
IFNULL(expression, value_if_null) Use IFNULL(expression, value_if_null) to handle null values in results
expression IS NULL Use SELECT employee_id, name FROM employees WHERE bonus IS NULL;
Indexes Use indexes on columns related to JOIN, FILTER, and GROUP BY operations
Avoid full table scans Optimize queries to prevent full table scans
Simplify queries Use CTEs to simplify complex queries
DB Query Comparison
Concept MySQL Oracle SQL PostgreSQL
Basic SQL Syntax SELECT, INSERT, UPDATE, DELETE, FROM, WHERE, GROUP BY, HAVING, ORDER BY SELECT, INSERT, UPDATE, DELETE, FROM, WHERE, GROUP BY, HAVING, ORDER BY SELECT, INSERT, UPDATE, DELETE, FROM, WHERE, GROUP BY, HAVING, ORDER BY
Data Types INT, VARCHAR, DATE, TEXT, BLOB, BOOLEAN NUMBER, VARCHAR2, DATE, CLOB, BLOB, BOOLEAN INTEGER, VARCHAR, DATE, TEXT, BYTEA, BOOLEAN
Creating Tables CREATE TABLE table_name (column1 datatype, column2 datatype, ...); CREATE TABLE table_name (column1 datatype, column2 datatype, ...); CREATE TABLE table_name (column1 datatype, column2 datatype, ...);
Inserting Data INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...); INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...); INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...);
Updating Data UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition; UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition; UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition;
Deleting Data DELETE FROM table_name WHERE condition; DELETE FROM table_name WHERE condition; DELETE FROM table_name WHERE condition;
Querying Data SELECT column1, column2, ... FROM table_name WHERE condition; SELECT column1, column2, ... FROM table_name WHERE condition; SELECT column1, column2, ... FROM table_name WHERE condition;
Joins INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN, CROSS JOIN INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN, CROSS JOIN, NATURAL JOIN
Subqueries SELECT * FROM (SELECT column1, column2, ... FROM table_name WHERE condition); SELECT * FROM (SELECT column1, column2, ... FROM table_name WHERE condition); SELECT * FROM (SELECT column1, column2, ... FROM table_name WHERE condition);
Indexes CREATE INDEX index_name ON table_name (column1, column2, ...); CREATE INDEX index_name ON table_name (column1, column2, ...); CREATE INDEX index_name ON table_name (column1, column2, ...);
Views CREATE VIEW view_name AS SELECT column1, column2, ... FROM table_name WHERE condition; CREATE VIEW view_name AS SELECT column1, column2, ... FROM table_name WHERE condition; CREATE VIEW view_name AS SELECT column1, column2, ... FROM table_name WHERE condition;
Transactions START TRANSACTION; ... COMMIT; or ROLLBACK; BEGIN TRANSACTION; ... COMMIT; or ROLLBACK; BEGIN; ... COMMIT; or ROLLBACK;
Stored Procedures CREATE PROCEDURE procedure_name AS BEGIN ... END; CREATE OR REPLACE PROCEDURE procedure_name AS BEGIN ... END; CREATE OR REPLACE PROCEDURE procedure_name AS $$ BEGIN ... END $$ LANGUAGE plpgsql;
Functions CREATE FUNCTION function_name (parameters) RETURNS return_datatype AS BEGIN ... END; CREATE OR REPLACE FUNCTION function_name (parameters) RETURN return_datatype AS BEGIN ... END; CREATE OR REPLACE FUNCTION function_name (parameters) RETURNS return_datatype AS $$ BEGIN ... END $$ LANGUAGE plpgsql;
Triggers CREATE TRIGGER trigger_name BEFORE/AFTER INSERT/UPDATE/DELETE ON table_name FOR EACH ROW ...; CREATE OR REPLACE TRIGGER trigger_name BEFORE/AFTER INSERT/UPDATE/DELETE ON table_name FOR EACH ROW ...; CREATE TRIGGER trigger_name BEFORE/AFTER INSERT/UPDATE/DELETE ON table_name FOR EACH ROW ...;
Constraints PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, DEFAULT PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, EXCLUDE
Backup and Recovery mysqldump, mysqlhotcopy, InnoDB backup RMAN, Data Pump, Flashback pg_dump, pg_basebackup, Continuous Archiving
Performance Tuning EXPLAIN, ANALYZE, INDEX, Query optimization Optimizer hints, SQL Tuning Advisor, Indexing EXPLAIN, ANALYZE, VACUUM, Indexing
Security Management GRANT, REVOKE, CREATE USER, ALTER USER, PASSWORD POLICIES DBA_ROLES, DBA_USERS, VPD (Virtual Private Database) GRANT, REVOKE, CREATE ROLE, ALTER ROLE, Row Level Security (RLS)
Partitioning RANGE, LIST, HASH partitioning RANGE, LIST, HASH, INTERVAL partitioning RANGE, LIST, HASH partitioning
Replication Master-slave replication, Group replication Data Guard, GoldenGate Streaming replication, Logical replication
Sharding Manual sharding Automatic sharding Manual sharding
Data Warehousing MySQL data warehousing solutions Oracle Exadata, Oracle Autonomous Data Warehouse PostgreSQL data warehousing solutions
Data Lakes MySQL data lakes integration Oracle Big Data Appliance PostgreSQL data lakes integration
SQL Recursive Queries
Concept Description MySQL Implementation Oracle SQL Implementation PostgreSQL Implementation Use Cases Sources
Recursive Common Table Expressions (CTEs) Recursive CTEs are used to handle hierarchical data and perform recursive queries. WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name; WITH cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name; WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name; Hierarchical data, organizational charts, directory structures HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Hierarchical Queries Specialized queries to manage hierarchical data, often using CONNECT BY. Not natively supported; typically use recursive CTEs or application-level logic. SELECT ... CONNECT BY PRIOR ... Achieved using recursive CTEs similar to MySQL. Tree structures, bill of materials, family trees HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Managing Recursive Queries Techniques to manage and optimize recursive queries, such as limiting recursion depth. WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... WHERE ... ) SELECT * FROM cte_name; SELECT ... CONNECT BY PRIOR ... AND LEVEL <= n; WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... WHERE ... ) SELECT * FROM cte_name; Performance optimization, preventing infinite loops HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Practical Examples and Use Cases Real-world examples of recursive queries, such as finding all subordinates in an organizational hierarchy. WITH RECURSIVE EmployeeHierarchy AS ( SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, h.level + 1 FROM employees e INNER JOIN EmployeeHierarchy h ON e.manager_id = h.id ) SELECT * FROM EmployeeHierarchy; SELECT id, name, manager_id FROM employees CONNECT BY PRIOR id = manager_id; WITH RECURSIVE EmployeeHierarchy AS ( SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, h.level + 1 FROM employees e INNER JOIN EmployeeHierarchy h ON e.manager_id = h.id ) SELECT * FROM EmployeeHierarchy; Employee hierarchies, category hierarchies in e-commerce HackerEarth, DataCamp, KDnuggets, FreeCodeCamp

Cloud Knowledge

Projects
Batch_project
AWS_E2E_Batch_Pipeline_Reddit
Stream_project
AWS_E2E_Realtime_Pipeline_Yelp

Certifications

AWS Certification
AWS Certification
AWS Certified Data Analytics
GCP Certification
GCP Certification
GCP Professional Data Engineer

Blog posts

Streaming
AWS Kinesis + Firehose Hands-on
AWS Kinesis + Multi Lambda
Batch
Data Migration + MySQLDump
Uncompress S3 using Lambda
AWS Kinesis + Multi Lambda
GCP cloud SQL+ AWS DMS + AWS RDS
Programming
Python Essential Concepts

Leetcode Knowledge

Click item below to expand

Common Data Structures and Algorithms Patterns
Pattern Description Use Cases
Sliding Window Perform operations on a specific window size within a data structure, such as finding the longest subarray.
Usage: Use when you need to perform operations on a subset of a linear data structure (array, string) of fixed size.
Clue: Look for problems asking to find longest/shortest substring, subarray, or desired value in a linear structure.
Linear data structures (arrays, strings)
Two Pointers Use two pointers iterating through a data structure in tandem to find pairs or subarrays fulfilling conditions.
Usage: Use when you need to find pairs or subarrays that meet certain conditions in sorted arrays or linked lists.
Clue: Useful for problems involving comparisons between elements, pairs, triplets, or subarrays in sorted structures.
Sorted arrays or linked lists
Fast and Slow Pointers Use two pointers moving at different speeds to detect cycles in linked lists or arrays.
Usage: Use to detect cycles in linked lists or arrays, or find the middle of a linked list.
Clue: Look for cyclic linked list/array problems or when needing to find midpoints or check for palindromes.
Cyclic data structures
Merge Intervals Efficiently handle overlapping intervals by merging or finding overlaps.
Usage: Use for problems involving overlapping intervals that need merging or identifying.
Clue: Look for problems mentioning overlapping intervals or requiring exclusive intervals.
Intervals, scheduling
Cyclic Sort Sort an array containing numbers in a given range by swapping elements to their correct positions.
Usage: Use for sorting arrays with numbers in a given range or finding missing/duplicate numbers.
Clue: Look for problems involving sorted arrays with given ranges or finding specific numbers.
Sorted arrays, finding missing/duplicate numbers
In-place Reversal Reverse links between nodes of a linked list without using extra memory.
Usage: Use when needing to reverse links in a linked list without using extra memory.
Clue: Look for constraints requiring in-place operations on linked lists.
Linked lists
Tree BFS Traverse a tree level by level using a queue.
Usage: Use for traversing trees level by level.
Clue: Look for problems asking for level-order traversal or working with tree levels.
Tree structures
Tree DFS Traverse a tree using depth-first search, using recursion or a stack.
Usage: Use for depth-first traversal of trees, especially when needing to process nodes pre/in/post-order.
Clue: Look for tree traversal problems requiring in-order, pre-order, or post-order processing.
Tree structures
Two Heaps Use a Min Heap and Max Heap to keep track of elements and find the median efficiently.
Usage: Use for problems needing to find median elements or split data into two parts efficiently.
Clue: Look for problems mentioning finding smallest/largest/median elements.
Finding smallest/largest/median elements
Subsets Use BFS to handle permutations and combinations of a given set of elements.
Usage: Use for problems involving generating permutations and combinations of a set of elements.
Clue: Look for problems requiring combinations or permutations of elements.
Combinations, permutations
Modified Binary Search Efficiently search for an element in a sorted array, linked list, or matrix.
Usage: Use for searching elements in sorted data structures (arrays, lists, matrices).
Clue: Look for problems involving sorted data and requiring efficient searching.
Sorted data structures
Top K Elements Use a Heap to find the top/smallest/frequent 'K' elements from a set.
Usage: Use for finding the top/smallest/frequent 'K' elements in a set.
Clue: Look for problems asking for top/smallest/frequent 'K' elements.
Finding top/smallest/frequent 'K' elements
K-way Merge Use a Heap to efficiently merge 'K' sorted arrays or lists.
Usage: Use for merging 'K' sorted arrays or lists efficiently.
Clue: Look for problems involving merging sorted lists or arrays.
Merging sorted arrays/lists
Topological Sort Perform topological sorting on a directed graph with no cycles to find a linear ordering of elements.
Usage: Use for sorting elements with dependencies in a directed graph without cycles.
Clue: Look for problems dealing with graphs, dependencies, or requiring sorted order of elements with dependencies.
Graphs, dependency resolution
Sliding Window
Field Details
Description Perform operations on a specific window size within a data structure, such as finding the longest subarray.
Usage Use when you need to perform operations on a subset of a linear data structure (array, string) of fixed size.
Clue Look for problems asking to find longest/shortest substring, subarray, or desired value in a linear structure.
Use Cases Linear data structures (arrays, strings)
Two Pointers
Field Details
Description Use two pointers iterating through a data structure in tandem to find pairs or subarrays fulfilling conditions.
Usage Use when you need to find pairs or subarrays that meet certain conditions in sorted arrays or linked lists.
Clue Useful for problems involving comparisons between elements, pairs, triplets, or subarrays in sorted structures.
Use Cases Sorted arrays or linked lists
Fast and Slow Pointers
Field Details
Description Use two pointers moving at different speeds to detect cycles in linked lists or arrays.
Usage Use to detect cycles in linked lists or arrays, or find the middle of a linked list.
Clue Look for cyclic linked list/array problems or when needing to find midpoints or check for palindromes.
Use Cases Cyclic data structures
Merge Intervals
Field Details
Description Efficiently handle overlapping intervals by merging or finding overlaps.
Usage Use for problems involving overlapping intervals that need merging or identifying.
Clue Look for problems mentioning overlapping intervals or requiring exclusive intervals.
Use Cases Intervals, scheduling
Cyclic Sort
Field Details
Description Sort an array containing numbers in a given range by swapping elements to their correct positions.
Usage Use for sorting arrays with numbers in a given range or finding missing/duplicate numbers.
Clue Look for problems involving sorted arrays with given ranges or finding specific numbers.
Use Cases Sorted arrays, finding missing/duplicate numbers
In-place Reversal
Field Details
Description Reverse links between nodes of a linked list without using extra memory.
Usage Use when needing to reverse links in a linked list without using extra memory.
Clue Look for constraints requiring in-place operations on linked lists.
Use Cases Linked lists
Tree BFS
Field Details
Description Traverse a tree level by level using a queue.
Usage Use for traversing trees level by level.
Clue Look for problems asking for level-order traversal or working with tree levels.
Use Cases Tree structures
Tree DFS
Field Details
Description Traverse a tree using depth-first search, using recursion or a stack.
Usage Use for depth-first traversal of trees, especially when needing to process nodes pre/in/post-order.
Clue Look for tree traversal problems requiring in-order, pre-order, or post-order processing.
Use Cases Tree structures
Two Heaps
Field Details
Description Use a Min Heap and Max Heap to keep track of elements and find the median efficiently.
Usage Use for problems needing to find median elements or split data into two parts efficiently.
Clue Look for problems mentioning finding smallest/largest/median elements.
Use Cases Finding smallest/largest/median elements
Subsets
Field Details
Description Use BFS to handle permutations and combinations of a given set of elements.
Usage Use for problems involving generating permutations and combinations of a set of elements.
Clue Look for problems requiring combinations or permutations of elements.
Use Cases Combinations, permutations
Modified Binary Search
Field Details
Description Efficiently search for an element in a sorted array, linked list, or matrix.
Usage Use for searching elements in sorted data structures (arrays, lists, matrices).
Clue Look for problems involving sorted data and requiring efficient searching.
Use Cases Sorted data structures
Top K Elements
Field Details
Description Use a Heap to find the top/smallest/frequent 'K' elements from a set.
Usage Use for finding the top/smallest/frequent 'K' elements in a set.
Clue Look for problems asking for top/smallest/frequent 'K' elements.
Use Cases Finding top/smallest/frequent 'K' elements
K-way Merge
Field Details
Description Use a Heap to efficiently merge 'K' sorted arrays or lists.
Usage Use for merging 'K' sorted arrays or lists efficiently.
Clue Look for problems involving merging sorted lists or arrays.
Use Cases Merging sorted arrays/lists
Topological Sort
Field Details
Description Perform topological sorting on a directed graph with no cycles to find a linear ordering of elements.
Usage Use for sorting elements with dependencies in a directed graph without cycles.
Clue Look for problems dealing with graphs, dependencies, or requiring sorted order of elements with dependencies.
Use Cases Graphs, dependency resolution

Please Note: This content is self-made work (Some Assistance of AI involved). Please do not reproduce or copy any part of this document without providing proper credits about this GitHub profile. Thank you for respecting intellectual property rights.

Pinned Loading

  1. Python_learning Python_learning Public

    Jupyter Notebook 1 1

  2. Leetcode_concepts Leetcode_concepts Public

    Jupyter Notebook