prasanna kumar prasku5

Blog -

School: Indiana University Bloomington	Currently Playing	Python IDE	SQL IDE

Python Knowledge

Click item below to expand

Core Python Skills

Category	Skills
Syntax and Semantics	- Understanding Python’s syntax, data types, operators, and control flow (loops, conditionals).
Functions	- Writing and using functions - Lambda expressions - Decorators - Higher-order functions
Modules and Packages	- Creating, importing, and managing Python modules and packages.
Object-Oriented Programming (OOP)	- Classes - Inheritance - Polymorphism - Encapsulation - Designing robust class hierarchies
Error Handling	- Exception handling using try, except, finally - Custom exceptions
Iterators and Generators	- Understanding iterators and generators - Use cases for memory-efficient data processing
Comprehensions	- List comprehensions - Dictionary comprehensions - Set comprehensions
Context Managers	- Using with statements - Creating custom context managers for resource management
Decorators and Metaprogramming	- Using and creating decorators - Understanding metaclasses for advanced programming

PySpark Topics

Category	Topics
Introduction to PySpark	Basics of Spark, Spark architecture, setting up PySpark environment.
RDD (Resilient Distributed Dataset)	RDD operations (transformations and actions), creating RDDs, RDD lineage, fault tolerance.
DataFrames	Creating DataFrames, schema management, data manipulation, SQL operations on DataFrames.
Spark SQL	SQL functions, running SQL queries, DataFrame API vs. SQL API.
PySpark Functions	User Defined Functions (UDFs), built-in functions (col, lit, when).
Data Processing	Aggregations, joins, filtering, sorting, groupBy operations, handling duplicates.
Performance Optimization	Lazy evaluation, caching/persisting, partitioning, broadcast variables.
Machine Learning	Using MLlib for machine learning tasks, creating pipelines, model training and evaluation.
Data Handling	Handling missing values (mean, median imputation), dealing with skewed data, handling duplicates.
Advanced Operations	Window functions, working with large datasets, data streaming.
Error Handling & Debugging	Common errors in PySpark, debugging PySpark jobs, logging.
Cluster Management	Understanding different cluster managers (YARN, Mesos, Kubernetes, Standalone), resource allocation.
Data Serialization	Efficient serialization methods for performance improvement.
Graph Processing	Basics of GraphFrames, GraphX for graph processing.
Integration with Other Tools	Working with Hadoop, Hive, Kafka for data ingestion and processing.
PySpark SQL	Working with PySpark SQL, SparkSession, SparkContext, DataFrames.
Data Formats	Reading and writing data in various formats like CSV, Parquet, JSON.
Algorithms	Common algorithms for clustering, classification, regression, recommendation using MLlib.

PySpark Methods

Category	Methods
RDD Operations	- `map()` - `flatMap()` - `filter()` - `mapPartitions()` - `reduce()` - `reduceByKey()` - `groupByKey()` - `sortByKey()` - `join()` - `cogroup()` - `distinct()` - `union()` - `intersection()` - `subtract()` - `zipWithIndex()` - `cache()` - `persist()` - `countByValue()` - `take()` - `takeSample()` - `sample()` - `collect()` - `first()` - `foreach()` - `saveAsTextFile()` - `saveAsSequenceFile()` - `saveAsObjectFile()`
DataFrame Creation	- `createDataFrame()` - `read()` - `from_records()` - `from_csv()` - `from_json()` - `from_parquet()` - `from_avro()` - `from_orc()`
DataFrame Operations	- `select()` - `filter()` - `where()` - `drop()` - `withColumn()` - `withColumnRenamed()` - `alias()` - `distinct()` - `orderBy()` - `sort()` - `groupBy()` - `agg()` - `count()` - `sum()` - `avg()` - `max()` - `min()` - `collect()` - `head()` - `show()` - `describe()` - `printSchema()` - `dropDuplicates()` - `limit()` - `union()` - `intersect()` - `except()` - `join()` - `crossJoin()` - `coalesce()` - `repartition()` - `cache()` - `persist()` - `toPandas()`
SQL Functions	- `col()` - `lit()` - `when()` - `expr()` - `regexp_replace()` - `concat()` - `substr()` - `length()` - `lower()` - `upper()` - `trim()` - `ltrim()` - `rtrim()` - `split()` - `date_format()` - `to_date()` - `year()` - `month()` - `dayofmonth()` - `hour()` - `minute()` - `second()` - `datediff()` - `current_date()` - `current_timestamp()` - `lag()` - `lead()` - `rank()` - `dense_rank()` - `row_number()`
Aggregation Functions	- `sum()` - `avg()` - `min()` - `max()` - `count()` - `agg()` - `countDistinct()` - `approxQuantile()` - `pivot()` - `cube()` - `rollup()`
Window Functions	- `window()` - `rowsBetween()` - `rangeBetween()` - `partitionBy()` - `orderBy()` - `rank()` - `dense_rank()` - `row_number()` - `lag()` - `lead()`
DataFrame I/O	- `read()` - `write()` - `save()` - `load()` - `csv()` - `json()` - `parquet()` - `orc()` - `avro()` - `jdbc()` - `hive()`
Performance Optimization	- `cache()` - `persist()` - `unpersist()` - `repartition()` - `coalesce()` - `broadcast()` - `checkpoint()` - `explain()` - `explainMode()`
Machine Learning (MLlib)	- `fit()` - `transform()` - `predict()` - `evaluate()` - `crossValidator()` - `paramGrid()` - `pipeline()` - `VectorAssembler()` - `StringIndexer()` - `OneHotEncoder()` - `RandomForestClassifier()` - `LogisticRegression()` - `DecisionTreeClassifier()` - `KMeans()` - `ALS()` - `PCA()` - `LSA()` - `NaiveBayes()` - `GradientBoostedTrees()` - `LinearRegression()` - `ClassificationEvaluator()` - `RegressionEvaluator()`
Data Handling	- `fillna()` - `dropna()` - `replace()` - `dropDuplicates()` - `na.fill()` - `na.drop()` - `na.replace()` - `na.drop()` - `na.fill()` - `na.replace()`
Graph Processing	- `graphFrames()` - `pageRank()` - `shortestPaths()` - `connectedComponents()` - `triangleCount()` - `labelPropagation()`
Error Handling & Debugging	- `try-except` - `debug()` - `log4j()` - `info()` - `warn()` - `error()` - `setLogLevel()`
Cluster Management	- `standalone()` - `yarn()` - `mesos()` - `kubernetes()` - `submit()` - `kill()` - `status()` - `logs()` - `deploy()`
Data Serialization	- `write.format()` - `saveAsTable()` - `load()` - `to_parquet()` - `to_csv()` - `to_json()`
Integration with Other Tools	- `hadoop()` - `hive()` - `kafka()` - `jdbc()` - `spark-submit()` - `spark-shell()` - `hdfs()` - `sqoop()`

Spark Performance Tuning

Category	Technique	Code Example
Resource Management	Tune Executor Memory: Adjust the executor memory for optimal performance.	`python\nspark.conf.set("spark.executor.memory", "4g")\n`
	Tune Executor Cores: Adjust the number of cores per executor for better performance.	`python\nspark.conf.set("spark.executor.cores", "4")\n`
	Tune Driver Memory: Set the driver memory for better resource management.	`python\nspark.conf.set("spark.driver.memory", "4g")\n`
	Tune Task Parallelism: Adjust the number of tasks to improve parallelism.	`python\nspark.conf.set("spark.default.parallelism", "100")\n`
	Tune Resource Allocation: Configure resource allocation based on workload requirements.	`python\n# Example configuration\n`
Data Management	Optimize Data Formats: Use efficient data formats like Parquet or ORC.	`python\n# Writing data in Parquet format\ndf.write.format("parquet").save("data.parquet")\n`
	Optimize File Compression: Use efficient compression formats like Snappy or Gzip.	`python\n# Writing data with Snappy compression\ndf.write.option("compression", "snappy").parquet("data.parquet")\n`
	Optimize File Reading: Use efficient file reading operations.	`python\n# Reading files efficiently\nspark.read.format("parquet").load("data.parquet")\n`
	Use Efficient Data Sources: Choose efficient data sources for faster access.	`python\n# Reading from a JDBC source\ndf = spark.read.format("jdbc").options(url="jdbc:mysql://localhost/test", dbtable="table").load()\n`
	Optimize Data Serialization: Use efficient serialization methods.	`python\n# Using Kryo serializer\nspark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\n`
	Use Efficient Data Partitioning: Partition data based on query requirements.	`python\n# Partitioning DataFrame by date\ndf_partitioned = df.repartition("date")\n`
	Use Efficient File Formats for Writes: Prefer formats like ORC or Parquet for writing data.	`python\n# Writing in ORC format\ndf.write.format("orc").save("data.orc")\n`
	Optimize Data Loading and Storage: Improve data loading and storage techniques.	`python\n# Efficient data loading and storage\ndf_loaded = spark.read.option("header", "true").csv("data.csv")\ndf_loaded.write.option("compression", "gzip").csv("data.csv")\n`
	Minimize Data Duplication: Remove duplicate data to reduce storage and computation costs.	`python\n# Removing duplicates\ndf_deduplicated = df.dropDuplicates()\n`
Shuffle and Task Management	Optimize Shuffle Operations: Minimize shuffle operations by optimizing transformations.	`python\n# Optimizing shuffle operations\ndf_filtered = df.filter(df["value"] > 10)\n`
	Tune Shuffle Partitions: Adjust the number of shuffle partitions for better performance.	`python\n# Setting shuffle partitions\nspark.conf.set("spark.sql.shuffle.partitions", "100")\n`
	Optimize DAG Execution: Analyze and optimize Directed Acyclic Graph (DAG) execution plans.	`python\n# Analyze DAG execution using Spark UI.\n`
	Avoid Long-Running Tasks: Break down long-running tasks into smaller tasks to improve performance.	`python\n# Example is dependent on specific task implementations.\n`
	Tune Task Scheduling: Optimize task scheduling to reduce resource contention.	`python\n# Task scheduling tuning is dependent on specific cluster configurations.\n`
Join Optimization	Use Efficient Joins: Optimize join operations to reduce processing time.	`python\n# Efficient join operation\ndf_joined = df1.join(df2, df1.key == df2.key, "inner")\n`
	Tune Broadcast Join Size: Configure the size of DataFrames for broadcast joins.	`python\nspark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")\n`
	Optimize Joins and Aggregations: Optimize join and aggregation operations for better performance.	`python\n# Efficient join and aggregation\ndf_result = df1.join(df2, df1.key == df2.key).groupBy("key").agg({"value": "sum"})\n`
	Handle Skewed Joins: Apply techniques to mitigate skew in join operations.	`python\n# Handling skewed joins using salting\ndf_skewed = df1.withColumn("salt", df1["key"] % 10).join(df2.withColumn("salt", df2["key"] % 10), ["key", "salt"])\n`
Performance Analysis	Monitor Resource Usage: Use Spark UI and monitoring tools to track resource usage and performance.	`python\n# Monitoring resource usage using Spark UI.\n`
	Optimize Execution Plans: Analyze and optimize Spark SQL execution plans for better performance.	`python\n# Analyzing execution plans\ndf.explain(True)\n`
	Use Efficient Data Filtering: Apply efficient filtering techniques to reduce data size.	`python\n# Efficient data filtering\ndf_filtered = df.filter(df["value"] > 100)\n`
Caching and Persistence	Use DataFrame Caching Wisely: Cache DataFrames that are reused multiple times in computations.	`python\n# Caching DataFrame\ndf.cache()\n`
	Optimize Cache Management: Manage cached data efficiently to avoid unnecessary memory usage.	`python\n# Unpersist cached DataFrame\ndf.unpersist()\n`
Data Handling and Transformation	Optimize Data Aggregations: Use optimized aggregation strategies for large datasets.	`python\n# Using optimized aggregation\nresult_df = df.groupBy("key").agg({"value": "max"})\n`
	Optimize Data Sorting: Use efficient sorting algorithms for large datasets.	`python\n# Sorting DataFrame efficiently\ndf_sorted = df.orderBy("key")\n`
	Use Efficient Data Transformations: Optimize data transformations for better performance.	`python\n# Efficient transformations\ndf_transformed = df.withColumn("new_col", df["old_col"] * 2)\n`
	Reduce Data Serialization Time: Optimize data serialization to reduce processing time.	`python\n# Optimizing serialization\nspark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\n`
	Avoid Redundant Computations: Minimize redundant computations to improve performance.	`python\n# Avoiding redundant computations\nresult_df = df.select("col1").distinct()\n`
Handling Skew and Bottlenecks	Avoid Data Skew: Use techniques to mitigate data skew during processing	`python\n# Example data skew mitigation using salting\ndf_skewed = df.withColumn("salt", df["key"] % 10)\n`
	Minimize Data Movement: Reduce the movement of data across the network.	`python\n# Example code is dependent on network and cluster settings.\n`
Configuration Tuning	Tune Spark Configuration Parameters: Adjust Spark configuration parameters for better performance.	`python\n# Tuning configuration parameters\nspark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")\n`
	Tune Executor Log Level: Adjust executor log levels for debugging and performance.	`python\nspark.conf.set("spark.executor.logs.rolling.maxSize", "10MB")\n`
File Management	Optimize Data Reads: Use efficient data reading strategies for large files.	`python\n# Reading large files efficiently\ndf_large = spark.read.option("header", "true").csv("large_data.csv")\n`
	Use Efficient Data Loading: Load data efficiently to reduce processing time.	`python\n# Efficient data loading\ndf_loaded = spark.read.json("data.json")\n`
Table Size Handling	Handle Large Table Sizes: Optimize operations for tables in GBs, TBs, and PBs.	`python\n# Example techniques include using partitioning and optimizing joins. \n`
	Partition Large Tables: Partition large tables to improve processing efficiency.	`python\n# Partitioning large tables\ndf_partitioned = df.repartition("partition_col")\n`
	Optimize Data Writes for Large Tables: Use efficient data writing techniques for large tables.	`python\n# Efficient writing for large tables\ndf.write.mode("overwrite").parquet("large_table.parquet")\n`
Join Types and Performance	Broadcast Joins: Use broadcast joins for smaller DataFrames.	`python\n# Using broadcast join\nfrom pyspark.sql.functions import broadcast\ndf_broadcasted = df1.join(broadcast(df2), "key")\n`
	Shuffle Joins: Use shuffle joins for larger DataFrames that cannot fit in memory.	`python\n# Using shuffle join\ndf_shuffled = df1.join(df2, "key")\n`
	Bucketed Joins: Optimize joins by bucketing tables on the join column.	`python\n# Bucketing tables\nspark.conf.set("spark.sql.bucketing.enabled", "true")\n`
	Sort-Merge-Bucket Joins: Use sort-merge-bucket joins for efficient processing.	`python\n# Using sort-merge-bucket joins\nspark.conf.set("spark.sql.join.preferSortMergeJoin", "true")\n`

Data Structures and Algorithms

Category	Skills
Built-in Data Structures	- Lists - Tuples - Sets - Dictionaries - Arrays
Algorithms	- Sorting: Merge Sort, Quick Sort, Heap Sort - Searching: Binary Search, Linear Search
Advanced Data Structures	- Linked Lists: Singly, Doubly - Stacks - Queues: FIFO, Priority Queues - Trees: Binary Trees, AVL Trees, B-trees - Heaps: Min-Heap, Max-Heap - Graphs: Adjacency Matrix, Adjacency List
Dynamic Programming	- Techniques for solving complex problems by breaking them into simpler subproblems
Hashing	- Hash Tables - Handling Collisions: Chaining, Open Addressing
Tries	- Efficiently handling prefixes and dynamic sets of strings
Disjoint Set (Union-Find)	- Managing partitions and connected components in sets
Bit Manipulation	- Techniques for efficient data manipulation using bitwise operations
Amortized Analysis	- Analyzing the performance of algorithms over time considering occasional expensive operations

Data Handling and Processing

Category	Skills
File Handling	- CSV - JSON - XML - Parquet - Avro - Other file formats used in data engineering
Data Parsing	- Techniques for parsing and extracting data from various formats including JSON, XML, and custom formats
Data Manipulation	- Using libraries like Pandas for data cleaning - Transformation - Aggregation - Feature engineering
Data Integration	- Combining data from multiple sources - Handling schema evolution - Dealing with inconsistencies
Data Transformation	- Techniques for transforming data using ETL processes - Tools like Apache NiFi, Talend, or custom Python scripts
Data Validation	- Implementing data validation checks to ensure data quality and consistency
Streaming Data Processing	- Handling real-time data streams using frameworks like Apache Kafka, Apache Flink, or Spark Streaming
Data Serialization	- Understanding serialization formats (JSON, Protobuf, Avro) - Impact on performance and interoperability
Data Normalization	- Techniques for normalizing and denormalizing data for database design and performance optimization
Batch vs. Real-Time Processing	- Differentiating between batch processing and real-time processing - Understanding when to use each approach
Data Lineage	- Tracking the origin and transformations of data - Ensuring traceability and auditability

Design Patterns and Principles

Category	Skills
Creational Patterns	- Singleton - Factory Method - Abstract Factory - Builder - Prototype
Structural Patterns	- Adapter - Decorator - Proxy - Composite - Flyweight - Bridge
Behavioral Patterns	- Observer - Strategy - Command - Chain of Responsibility - Mediator - Memento - State - Template Method - Visitor
Concurrency Patterns	- Producer-Consumer - Read-Write Lock - Thread Pool - Future/Promise
Architectural Patterns	- Microservices - Serverless Architecture - Event-Driven Architecture - CQRS (Command Query Responsibility Segregation)
Design Principles	- SOLID Principles: Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, Dependency Inversion - DRY (Don't Repeat Yourself) - KISS (Keep It Simple, Stupid) - YAGNI (You Aren't Gonna Need It)
Other Patterns	- Facade - Iterator - Factory

APIs and Web Services

Category	Skills
RESTful APIs	- Creation and Consumption - Authentication (OAuth, JWT) - Rate Limiting - Pagination
SOAP APIs	- Understanding WSDL - Handling XML-based communication - SOAP Actions
Web Scraping	- BeautifulSoup - Scrapy - Handling JavaScript-rendered content (Selenium)
GraphQL	- Querying APIs with GraphQL - Schema Definition - Resolvers - Query Optimization
API Testing	- Using tools like Postman, Swagger, or Insomnia for API testing and documentation
API Documentation	- Creating and managing API documentation using tools like Swagger/OpenAPI
API Security	- Best practices for securing APIs, including encryption, authentication, and authorization
Microservices Communication	- Using message brokers (e.g., RabbitMQ, Kafka) for inter-service communication
WebSockets	- Understanding real-time communication with WebSockets for use cases like live data feeds
Service Orchestration	- Techniques for orchestrating microservices and managing inter-service communication
API Gateways	- Using API gateways (e.g., Kong, AWS API Gateway) for managing, securing, and monitoring API traffic

Testing and Debugging

Category	Skills
Unit Testing	- Using `unittest`, `pytest` for writing and executing test cases - Test Fixtures - Mocking - Parameterized Tests
Integration Testing	- Testing interactions between components or systems - Using tools like `pytest` fixtures or `unittest` integration tests
Functional Testing	- Ensuring the system functions as expected from an end-user perspective - Using tools like `pytest` and `Selenium` for web applications
End-to-End Testing	- Testing the complete flow of an application - Using frameworks like `Selenium`, `Cypress`
Performance Testing	- Assessing the performance of your application - Load testing and stress testing - Using tools like `locust`, `JMeter`
Security Testing	- Identifying security vulnerabilities in your code or system - Using tools like `OWASP ZAP`, `Bandit` for Python security analysis
Code Coverage	- Measuring the percentage of code covered by tests - Using tools like `coverage.py` or integrated coverage tools in CI/CD pipelines
Debugging Techniques	- Using debugging tools and techniques (e.g., `pdb`, `ipdb`, IDE debuggers like those in PyCharm or VS Code)
Logging	- Implementing logging strategies using Python’s `logging` module or third-party libraries to track application behavior and diagnose issues
Error Reporting	- Using tools for capturing and reporting errors in production environments - Examples: `Sentry`, `Loggly`
Static Code Analysis	- Analyzing code for potential errors, stylistic issues, and code smells - Using tools like `pylint`, `flake8`, `black` for code formatting
Continuous Integration (CI)	- Setting up CI pipelines for automated testing - Using services like GitHub Actions, Jenkins, Travis CI
Continuous Deployment (CD)	- Implementing CD pipelines for automated deployment - Ensuring integration and deployment in a smooth and consistent manner

Security and Compliance

Category	Skills
Data Privacy	- Understanding data privacy laws (e.g., GDPR, CCPA) - Implementing data anonymization and pseudonymization techniques - Ensuring compliance with legal requirements
Data Encryption	- Implementing encryption at rest and in transit - Understanding encryption algorithms (e.g., AES, RSA) - Managing encryption keys using tools like AWS KMS, Azure Key Vault
Access Control	- Implementing role-based access control (RBAC) - Attribute-based access control (ABAC) - Managing permissions and user roles in data systems
Audit Logging	- Tracking and recording access to data and system changes - Using tools and frameworks for log management
Security Best Practices	- Following best practices for securing data systems - Secure coding practices - Regular vulnerability assessments - Security patches
Network Security	- Understanding and implementing network security measures - Firewalls - VPNs - Network segmentation
Incident Response	- Developing and implementing an incident response plan - Detecting, responding to, and recovering from security incidents and breaches
Compliance Frameworks	- Familiarity with industry standards and frameworks (e.g., HIPAA, SOC 2, ISO 27001) - Ensuring systems and processes meet these standards
Data Governance	- Establishing data governance practices - Data stewardship - Data quality - Metadata management
Risk Management	- Assessing and managing risks related to data security and compliance - Conducting risk assessments - Implementing risk mitigation strategies

Advanced Programming Concepts

Category	Skills
Asynchronous Programming	- Asynchronous I/O with `asyncio` - Using `await` for handling asynchronous tasks - Managing event loops - Understanding concurrency patterns in Python
Concurrency	- Using `threading` for I/O-bound tasks - `multiprocessing` for CPU-bound tasks - Understanding the Global Interpreter Lock (GIL) - Implementing concurrent patterns
Metaprogramming	- Using and creating decorators - Understanding and using metaclasses for dynamic class creation - Customizing class behavior at runtime
Coroutines	- Writing and managing coroutines for efficient asynchronous code - Understanding how coroutines differ from regular functions
Context Managers	- Creating and using context managers for resource management - Understanding `with` statements - Implementing custom context managers
Dynamic Typing and Reflection	- Leveraging Python’s dynamic typing and reflection capabilities - Creating flexible and generic code
Memory Management	- Understanding Python’s memory management - Garbage collection - Techniques for memory optimization
Design by Contract	- Implementing preconditions, postconditions, and invariants - Ensuring code reliability and correctness
Functional Programming	- Utilizing functional programming techniques - Higher-order functions - Immutability - Pure functions for cleaner code
Meta-Programming Techniques	- Creating and using custom descriptors - Dynamic attribute access - Class factories for advanced programming scenarios

Data Preprocessing and Cleaning Operations

Category	Function/Method	Description	Source
Handling Missing Values	dropna, fillna, isnull, notnull	Drop or fill missing values, check for nulls.	freecodecamp.org, datagy.io
Removing Duplicates	drop_duplicates, duplicated	Identify and remove duplicate rows.	freecodecamp.org, datagy.io
Data Type Conversion	astype	Convert data types.	freecodecamp.org
String Operations	str	Vectorized string operations for cleaning text data.	realpython.com
Sorting Data	sort_values, sort_index	Sort data by values or index.	pickl.ai
Filtering Data	query, eval	Query and filter data efficiently.	pandas.pydata.org
Column Manipulation	rename, drop, set_index, reset_index	Rename, drop columns, set and reset index.	realpython.com
Aggregation and Grouping	groupby, agg	Group data and perform aggregate functions.	pandas.pydata.org
Pivoting Data	pivot_table, melt, stack, unstack	Reshape data for analysis.	stackoverflow.com, jakevdp.github.io
Merging and Joining	merge, join, concat	Combine data from multiple DataFrames.	realpython.com
Visualization	plot, hist, boxplot	Visualize data directly from DataFrames.	w3schools.com, DataCamp
Memory Optimization	memory_usage, astype	Optimize memory usage by changing data types.	pandas.pydata.org

Pandas Concepts

Category	Concepts
Basic Operations	Importing Pandas, creating DataFrames, Series, basic data structures, DataFrame operations (add, subtract, multiply, divide).
Data Import/Export	Reading data from CSV, Excel, JSON, SQL databases, and writing data to various formats (CSV, Excel, JSON).
Data Inspection	Viewing data (head(), tail(), info(), describe(), shape, dtypes, columns).
Indexing and Selection	Selecting data by rows and columns, using loc and iloc, boolean indexing, setting and resetting index.
Data Cleaning	Handling missing data (dropna(), fillna()), detecting and removing duplicates, renaming columns.
Data Transformation	Applying functions (apply(), map(), applymap()), vectorized operations, data type conversions, sorting data (sort_values(), sort_index()).
Grouping and Aggregation	Grouping data with groupby(), performing aggregations (sum(), mean(), count(), min(), max(), agg()).
Merging and Joining	Concatenating DataFrames (concat()), merging and joining DataFrames (merge(), join()).
Time Series	Working with date and time data, generating date ranges, resampling, time-based indexing.
Pivoting and Reshaping	Pivot tables (pivot_table()), reshaping data (melt(), stack(), unstack()).
Data Visualization	Plotting with Pandas (plot(), hist(), boxplot()), integration with Matplotlib and Seaborn.
Advanced Indexing	MultiIndex (hierarchical indexing), cross-section (xs()), advanced indexing with pd.IndexSlice.
Performance Optimization	Efficient memory usage, using eval() and query() for efficient computations, optimizing data types.
Iterating	Iterating over rows (iterrows(), itertuples()), columns, and applying functions across DataFrames.
Window Functions	Rolling and expanding windows, applying window functions (rolling(), expanding(), ewm()).
Data Serialization	Reading and writing from various formats like Parquet, HDF5, and Feather.
Integration with Other Libraries	Integration with NumPy for numerical operations, using Pandas with Scikit-Learn for machine learning workflows.

NumPy Concepts

Category	Concept	Description	Source
Array Creation	np.array(), np.zeros(), np.ones()	Creating arrays, arrays of zeros, and arrays of ones	The Data Monk (NumPy)
Array Initialization	np.empty(), np.full(), np.arange(), np.linspace()	Creating uninitialized arrays, arrays with a specified value, sequences of numbers, and evenly spaced numbers	Learn R, Python & Data Science Online
Basic Operations	Addition, subtraction, multiplication, division	Element-wise operations on arrays	datagy (ProjectPro)
Indexing and Slicing	Indexing, slicing, boolean indexing	Accessing array elements, slicing arrays, and using boolean conditions to filter arrays	datagy (Interview Query)
Reshaping and Resizing	np.reshape(), np.resize(), np.flatten()	Changing the shape and size of arrays, flattening arrays	Interview Query (NumPy)
Aggregation Functions	np.sum(), np.mean(), np.median(), np.std()	Calculating sum, mean, median, and standard deviation of array elements	The Data Monk (ProjectPro)
Broadcasting	Broadcasting rules	Applying operations on arrays of different shapes	NumPy (Python Tutorials – Real Python)
Linear Algebra	np.dot(), np.matmul(), np.linalg.inv(), np.linalg.eig()	Dot products, matrix multiplication, matrix inversion, eigenvalues, and eigenvectors	ProjectPro (NumPy)
Advanced Indexing	Fancy indexing, index arrays	Using arrays of indices to access elements	Interview Query (NumPy)
Data Types	dtype	Specifying and understanding data types within arrays	ProjectPro (NumPy)
File I/O	np.save(), np.load(), np.savetxt(), np.loadtxt()	Saving and loading arrays to/from files	NumPy
Random Sampling	np.random.rand(), np.random.randint(), np.random.choice()	Generating random numbers and sampling from arrays	The Data Monk (Learn R, Python & Data Science Online)
Statistical Functions	np.percentile(), np.histogram(), np.bincount()	Calculating percentiles, histograms, and counts of unique values	Python Tutorials – Real Python (ProjectPro)
Boolean Operations	Logical operations	Applying logical operations on arrays, such as np.any(), np.all()	datagy (ProjectPro)
Sorting and Searching	np.sort(), np.argsort(), np.searchsorted()	Sorting arrays and searching for elements	The Data Monk (ProjectPro)
Set Operations	np.unique(), np.intersect1d(), np.union1d()	Performing set operations on arrays	Learn R, Python & Data Science Online (NumPy)
Data Cleaning	Handling missing values, outlier detection	Removing null values, identifying and handling outliers	The Data Monk

Timeseries Analysis

Category	Function/Method	Description	Source
Resampling	resample	Resample time series data to different frequencies.	pandas.pydata.org
Rolling Windows	rolling	Apply rolling window calculations.	pandas.pydata.org
Expanding Windows	expanding	Apply expanding window calculations.	pandas.pydata.org
EWM	ewm	Apply exponential weighted functions.	pandas.pydata.org
Datetime Indexing	DatetimeIndex	Indexing with date ranges.	jakevdp.github.io
Time Period Indexing	PeriodIndex	Indexing with fixed-frequency intervals.	jakevdp.github.io

Timestamps and Date Handling

Category	Function/Method	Description	Source
Creating Timestamps	pd.Timestamp	Create a timestamp object.	pandas.pydata.org
Date Conversion	pd.to_datetime	Convert strings, epochs, or a mixture to DatetimeIndex.	pandas.pydata.org
Period Handling	pd.Period	Represent a time span.	pandas.pydata.org
Timedelta Handling	pd.Timedelta	Represent a time duration.	jakevdp.github.io
Date Range Generation	pd.date_range	Generate a fixed frequency date range.	jakevdp.github.io
Date Offsets	pd.offsets	Create date offsets for business days, months, years, etc.	pandas.pydata.org
Datetime Properties	.dt	Accessor to extract date properties like year, month, day, weekday, etc.	pandas.pydata.org

SQL Knowledge

Click item below to expand

Database Skills

Category	Skills
Basic SQL Operations	- `SELECT`, `INSERT`, `UPDATE`, `DELETE` - `JOINs`, `WHERE`, `GROUP BY`, `ORDER BY`
Joins	- `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, `FULL JOIN`, `CROSS JOIN`
Subqueries	- Nested queries - Correlated subqueries
Indexes	- Creating, using, and optimizing indexes for query performance
Views	- Creating, managing, and using database views for abstraction and security
Transactions	- ACID properties - `COMMIT`, `ROLLBACK`, `SAVEPOINT`
Stored Procedures	- Writing and managing stored procedures for encapsulating complex logic
Functions	- Creating and using functions for reusable code
Triggers	- Creating triggers for automatic actions in response to database changes
Constraints	- Defining primary keys, foreign keys, unique constraints, and check constraints
Backup and Recovery	- Strategies for database backup and recovery - Point-in-time recovery
Performance Tuning	- Query optimization - Index tuning - Execution plans - Profiling and analyzing performance
Security Management	- User roles - Permissions - Encryption - Securing database access
Partitioning	- Horizontal and vertical partitioning for managing large datasets
Replication	- Database replication techniques for redundancy and high availability
Sharding	- Data sharding strategies to distribute data across multiple servers
Data Warehousing	- Designing and implementing data warehouses for large-scale data storage and analysis
ETL (Extract, Transform, Load)	- Processes for extracting data from sources, transforming it, and loading it into a data warehouse
Data Lakes	- Creating and managing data lakes for storing raw and unstructured data
Recursive Queries	- Writing recursive queries to handle hierarchical data
Window Functions	- Using window functions for advanced analytics and calculations
Pivot and Unpivot	- Transforming data for better analysis and reporting
Data Types	- Understanding and using various data types, including numeric, string, date/time, and custom types
String Manipulation	- Functions and techniques for manipulating and querying string data

SQL Practice websites

Site Name	Link
Data Lemur	datalemur.com
SQLBolt	sqlbolt.com
StrataScratch	stratascratch.com
w3resource	w3resource.com

SQL Functions and Keywords

Category	Functions/Keywords
Basic SQL Operations	- `SELECT` - `INSERT` - `UPDATE` - `DELETE` - `JOIN` - `WHERE` - `GROUP BY` - `ORDER BY`
Joins	- `INNER JOIN` - `LEFT JOIN` - `RIGHT JOIN` - `FULL JOIN` - `CROSS JOIN`
Subqueries	- Nested queries - Correlated subqueries
Indexes	- `CREATE INDEX` - `DROP INDEX` - `ALTER INDEX` - `INDEX`
Views	- `CREATE VIEW` - `DROP VIEW` - `ALTER VIEW` - `VIEW`
Transactions	- `BEGIN TRANSACTION` - `COMMIT` - `ROLLBACK` - `SAVEPOINT` - `SET TRANSACTION`
Stored Procedures	- `CREATE PROCEDURE` - `ALTER PROCEDURE` - `DROP PROCEDURE` - `EXECUTE` - `CALL`
Functions	- `CREATE FUNCTION` - `ALTER FUNCTION` - `DROP FUNCTION` - `RETURN` - `CALL`
Triggers	- `CREATE TRIGGER` - `ALTER TRIGGER` - `DROP TRIGGER` - `BEFORE` - `AFTER` - `INSTEAD OF`
Constraints	- `PRIMARY KEY` - `FOREIGN KEY` - `UNIQUE` - `CHECK` - `NOT NULL` - `DEFAULT`
Backup and Recovery	- `BACKUP DATABASE` - `RESTORE DATABASE` - `BACKUP LOG` - `RESTORE LOG` - `WITH NORECOVERY`
Performance Tuning	- `EXPLAIN` - `ANALYZE` - `OPTIMIZE` - `INDEX` - `PROFILE` - `STATISTICS`
Security Management	- `GRANT` - `REVOKE` - `DENY` - `USER` - `ROLE` - `PERMISSIONS`
Partitioning	- `PARTITION BY` - `SUBPARTITION BY` - `ALTER PARTITION` - `DROP PARTITION`
Replication	- `MASTER` - `SLAVE` - `REPLICA` - `REPLICATION` - `SYNC` - `ASYNC`
Sharding	- `SHARD` - `SHARDING KEY` - `ROUTING` - `DISTRIBUTED`
Data Warehousing	- `OLAP` - `OLTP` - `CUBE` - `ROLLUP` - `DRILLDOWN` - `DIMENSION` - `FACT`
ETL (Extract, Transform, Load)	- `ETL` - `EXTRACT` - `TRANSFORM` - `LOAD` - `DATA INTEGRATION`
Data Lakes	- `DATA LAKE` - `RAW DATA` - `UNSTRUCTURED DATA` - `SCHEMA ON READ`
Recursive Queries	- `WITH RECURSIVE` - `UNION ALL` - `CTE` - `RECURSIVE`
Window Functions	- `ROW_NUMBER` - `RANK` - `DENSE_RANK` - `NTILE` - `PARTITION BY` - `ORDER BY` - `OVER`
Pivot and Unpivot	- `PIVOT` - `UNPIVOT` - `CROSS TAB` - `TRANSFORM`
Data Types	- `INT` - `VARCHAR` - `CHAR` - `TEXT` - `DATE` - `TIME` - `FLOAT` - `BOOLEAN` - `ARRAY` - `JSON`
String Manipulation	- `CONCAT` - `SUBSTRING` - `CHAR_LENGTH` - `TRIM` - `REPLACE` - `SPLIT` - `UPPER` - `LOWER`

Timestamp Functions and Format Conversion

Database	Functions/Keywords
Oracle	- `CURRENT_TIMESTAMP` - `SYSTIMESTAMP` - `LOCALTIMESTAMP` - `SYSDATE` - `CURRENT_DATE` - `EXTRACT` - `TO_TIMESTAMP` - `TO_DATE` - `CAST` - `DBTIMEZONE` - `SESSIONTIMEZONE` - `ADD_MONTHS` - `MONTHS_BETWEEN` - `NEXT_DAY` - `LAST_DAY` - `ROUND` - `TRUNC` - Format Conversion - `TO_CHAR` (e.g., `TO_CHAR(date_column, 'YYYY-MM-DD HH24:MI:SS')`) - `TO_TIMESTAMP` (e.g., `TO_TIMESTAMP('2024-07-25 12:34:56', 'YYYY-MM-DD HH24:MI:SS')`) - `TO_DATE` (e.g., `TO_DATE('2024-07-25', 'YYYY-MM-DD')`) - `FROM_TZ` (e.g., `FROM_TZ(TIMESTAMP '2024-07-25 12:34:56', 'UTC')`)
PostgreSQL	- `CURRENT_TIMESTAMP` - `NOW()` - `LOCALTIMESTAMP` - `CURRENT_DATE` - `CURRENT_TIME` - `EXTRACT` - `DATE_TRUNC` - `TO_TIMESTAMP` - `TO_DATE` - `AGE` - `DATE_PART` - `INTERVAL` - `TIMEZONE` - `FORMAT_TIMESTAMP` - `JUSTIFY_DAYS` - `JUSTIFY_HOURS` - Format Conversion - `TO_CHAR` (e.g., `TO_CHAR(timestamp_column, 'YYYY-MM-DD HH24:MI:SS')`) - `TO_TIMESTAMP` (e.g., `TO_TIMESTAMP('2024-07-25 12:34:56', 'YYYY-MM-DD HH24:MI:SS')`) - `TO_DATE` (e.g., `TO_DATE('2024-07-25', 'YYYY-MM-DD')`) - `TO_CHAR` (e.g., `TO_CHAR(timestamp_column, 'Month DD, YYYY')`)
MySQL	- `NOW()` - `CURRENT_TIMESTAMP` - `CURDATE()` - `CURTIME()` - `SYSDATE()` - `UTC_TIMESTAMP()` - `DATE_FORMAT` - `STR_TO_DATE` - `TIMESTAMPADD` - `TIMESTAMPDIFF` - `ADDDATE` - `SUBDATE` - `DATE_ADD` - `DATE_SUB` - `EXTRACT` - `UNIX_TIMESTAMP` - `FROM_UNIXTIME` - Format Conversion - `DATE_FORMAT` (e.g., `DATE_FORMAT(timestamp_column, '%Y-%m-%d %H:%i:%s')`) - Formats a timestamp or date according to the specified format string - `STR_TO_DATE` (e.g., `STR_TO_DATE('2024/07/25 12:34:56', '%Y/%m/%d %H:%i:%s')`) - Converts a string to a date or timestamp using a different custom format - `DATE_FORMAT` (e.g., `DATE_FORMAT(timestamp_column, '%W, %M %e, %Y %h:%i %p')`) - Formats a timestamp into a custom string format such as 'Wednesday, July 25, 2024 12:34 PM'

Data Quality Checks

Category	Concept	Description	Example	SQL Code	Sources
Completeness	Missing Values	Ensuring no critical data fields are missing values.	Check for NULL values in critical columns.	`SELECT * FROM table WHERE critical_column IS NULL;`	Monte Carlo Data, Airbyte, ProjectPro
Uniqueness	Duplicate Records	Ensuring that records are unique and not duplicated.	Use DISTINCT or unique constraints in SQL.	`SELECT DISTINCT * FROM table;` `ALTER TABLE table ADD CONSTRAINT unique_constraint UNIQUE(column);`	Monte Carlo Data, DataRobot, DvSum
Accuracy	Data Validity	Verifying that the data conforms to the required format and range.	Validate email formats and numerical ranges.	`SELECT * FROM table WHERE email NOT LIKE '%@%.%';` `SELECT * FROM table WHERE age < 0 OR age > 120;`	Acceldata, Databricks, IBM
Consistency	Referential Integrity	Ensuring that foreign keys in the data match the corresponding primary keys.	Foreign key constraints in SQL.	`ALTER TABLE child_table ADD CONSTRAINT fk_name FOREIGN KEY (foreign_key_column) REFERENCES parent_table(primary_key_column);`	Monte Carlo Data, Datafold
Timeliness	Data Freshness	Ensuring data is up-to-date and reflects the most recent information.	Timestamp checks to ensure data is recent.	`SELECT * FROM table WHERE last_updated > NOW() - INTERVAL 1 DAY;`	Acceldata, ProjectPro
Validity	Schema Validation	Ensuring that data conforms to predefined schema and structure.	Validate data types and field lengths.	`SELECT * FROM table WHERE LENGTH(column) > 255;` `SELECT * FROM table WHERE column NOT LIKE 'valid_format';`	Databricks, Datafold
Integrity	Data Lineage	Tracking the data flow and transformations from source to destination.	Maintain metadata about data transformations.	`-- Example not SQL based, but typically handled through ETL tools or metadata management systems.`	Oxagile, IABAC
Volume	Data Volume	Ensuring that the volume of data matches expectations.	Check row counts before and after ETL processes.	`SELECT COUNT(*) FROM table;`	Monte Carlo Data, DvSum
Range Checks	Value Range	Ensuring numerical values fall within expected ranges.	Validate that ages fall within 0-120 years.	`SELECT * FROM table WHERE age < 0 OR age > 120;`	KDnuggets, DataRobot
Pattern Matching	String Patterns	Ensuring string data follows a predefined pattern, such as email or phone number formats.	Use regular expressions to validate phone numbers.	`SELECT * FROM table WHERE phone_number NOT REGEXP '^[0-9]{10}$';`	Airbyte, ProjectPro
Distribution Checks	Data Distribution	Ensuring data is distributed as expected, often following a specific statistical distribution.	Validate that sales figures follow a normal distribution.	`-- Example not SQL based, typically done through statistical analysis tools.`	Monte Carlo Data, DataRobot, KDnuggets
Anomaly Detection	Outlier Detection	Identifying and handling outliers that deviate significantly from the norm.	Use statistical methods to identify outliers in sales data.	`SELECT * FROM table WHERE sales < (SELECT AVG(sales) - 3 * STDDEV(sales) FROM table) OR sales > (SELECT AVG(sales) + 3 * STDDEV(sales) FROM table);`	DvSum, DataRobot
Schema Drift	Detecting Schema Changes	Monitoring for unexpected changes in data schema.	Compare current schema with historical schema to detect changes.	`-- Example not SQL based, often done through schema comparison tools.`	Redpanda, Datafold, DataRobot
Volume Checks	Expected Data Volumes	Ensuring the data volumes match historical trends and expectations.	Validate that the number of transactions per day is within the expected range.	`SELECT COUNT(*) FROM transactions WHERE transaction_date = CURRENT_DATE;`	Monte Carlo Data, DvSum
Null Constraints	Ensuring Non-NULL Values	Ensuring that certain columns do not contain NULL values.	Set NOT NULL constraints on essential columns.	`ALTER TABLE table MODIFY column_name datatype NOT NULL;`	Databricks, Monte Carlo Data
Automated Testing	CI/CD for Data Quality	Implementing continuous integration and delivery practices to automatically test data quality during the pipeline.	Use CI tools like CircleCI to automate data quality checks on data model changes.	`-- Example not SQL based, involves using CI/CD tools and scripts.`	Datafold, IABAC
Data Profiling	Statistical Analysis	Analyzing the structure, content, and relationships within the data to understand its quality.	Use data profiling tools to generate statistics about the dataset.	`-- Example not SQL based, typically done through profiling tools.`	Airbyte, ProjectPro
Data Cleansing	Data Cleaning	Removing or correcting inaccurate records from a dataset.	Use tools like OpenRefine to clean messy data.	`-- Example not SQL based, often done through data cleaning tools.`	Airbyte, ProjectPro
Monitoring and Alerts	Real-Time Data Monitoring	Continuously monitoring data quality and setting up alerts for any anomalies or issues.	Use Apache Griffin to monitor data quality across platforms in real-time.	`-- Example not SQL based, involves monitoring tools and alert configurations.`	Redpanda, DvSum
Domain Constraints	Ensuring Domain-Specific Constraints	Ensuring that data values fall within a specific domain or set of allowable values.	Enforce domain-specific rules like product categories in an e-commerce database.	`SELECT * FROM table WHERE product_category IN ('Electronics', 'Clothing', 'Books');`	Monte Carlo Data, Datafold
Compliance Checks	Regulatory Compliance	Ensuring data complies with relevant regulations and standards (e.g., GDPR, CCPA).	Anonymize personal data to comply with GDPR requirements.	`-- Example not SQL based, involves anonymization processes.`	IABAC, Oxagile
Granularity and Relevance	Appropriate Data Granularity	Ensuring data granularity matches the intended use and is relevant for analysis.	Adjust data granularity to avoid excessive detail or too little detail.	`-- Example not SQL based, often done through ETL and data transformation processes.`	IBM, Acceldata
Version Control	Tracking Changes to Data	Maintaining versions of data to track changes and ensure consistency.	Use version control systems for managing data schemas and transformations.	`-- Example not SQL based, involves using version control systems for schemas.`	Datafold, IABAC
Impact Analysis	Analyzing Impact of Data Changes	Assessing the impact of data model changes on existing data and processes.	Conduct impact analysis to ensure changes do not disrupt existing data pipelines.	`-- Example not SQL based, involves impact analysis tools and processes.`	DvSum, Datafold

String Functions

Function	Description	Example Usage	Result
`CONCAT()`	Concatenates two or more strings.	`SELECT CONCAT('Hello', ' ', 'World');`	`Hello World`
`SUBSTRING()`	Extracts a substring from a string.	`SELECT SUBSTRING('Hello World', 1, 5);`	`Hello`
`LENGTH()`	Returns the length of a string.	`SELECT LENGTH('Hello');`	`5`
`UPPER()`	Converts a string to uppercase.	`SELECT UPPER('hello');`	`HELLO`
`LOWER()`	Converts a string to lowercase.	`SELECT LOWER('HELLO');`	`hello`
`TRIM()`	Removes leading and trailing spaces from a string.	`SELECT TRIM(' Hello ');`	`Hello`
`REPLACE()`	Replaces all occurrences of a substring within a string.	`SELECT REPLACE('Hello World', 'World', 'SQL');`	`Hello SQL`
`CHARINDEX()`	Returns the position of the first occurrence of a substring in a string.	`SELECT CHARINDEX('World', 'Hello World');`	`7`
`RIGHT()`	Extracts the right part of a string with a specified length.	`SELECT RIGHT('Hello World', 5);`	`World`
`LEFT()`	Extracts the left part of a string with a specified length.	`SELECT LEFT('Hello World', 5);`	`Hello`
`POSITION()`	Returns the position of the first occurrence of a substring within a string.	`SELECT POSITION('World' IN 'Hello World');`	`7`
`LTRIM()`	Removes leading spaces from a string.	`SELECT LTRIM(' Hello');`	`Hello`
`RTRIM()`	Removes trailing spaces from a string.	`SELECT RTRIM('Hello ');`	`Hello`
`REVERSE()`	Reverses the characters in a string.	`SELECT REVERSE('Hello');`	`olleH`
`FORMAT()`	Formats a number or date value according to a specified format.	`SELECT FORMAT(1234567.89, '###,###.##');`	`1,234,567.89`

Number Functions

Function	Description	Example Usage	Result
`ROUND()`	Rounds a numeric field to the number of decimals specified.	`SELECT ROUND(123.4567, 2);`	`123.46`
`FLOOR()`	Rounds a number down to the nearest integer.	`SELECT FLOOR(123.4567);`	`123`
`CEILING()`	Rounds a number up to the nearest integer.	`SELECT CEILING(123.4567);`	`124`
`ABS()`	Returns the absolute value of a number.	`SELECT ABS(-123.456);`	`123.456`
`RAND()`	Returns a random float value between 0 and 1.	`SELECT RAND();`	`0.1234`
`PI()`	Returns the value of π (pi).	`SELECT PI();`	`3.14159`
`POWER()`	Raises a number to the power of another number.	`SELECT POWER(2, 3);`	`8`
`SQRT()`	Returns the square root of a number.	`SELECT SQRT(16);`	`4`
`LOG()`	Returns the natural logarithm of a number.	`SELECT LOG(10);`	`2.3026`
`EXP()`	Returns e raised to the power of a number.	`SELECT EXP(1);`	`2.7183`
`CAST()`	Converts a value from one data type to another.	`SELECT CAST('123' AS INT);`	`123`
`CONVERT()`	Converts a value from one data type to another.	`SELECT CONVERT(VARCHAR, 123);`	`123`
`SUM()`	Returns the sum of a numeric column.	`SELECT SUM(salary) FROM employees;`	`50000`
`AVG()`	Returns the average value of a numeric column.	`SELECT AVG(salary) FROM employees;`	`5000`
`MIN()`	Returns the minimum value of a numeric column.	`SELECT MIN(salary) FROM employees;`	`3000`
`MAX()`	Returns the maximum value of a numeric column.	`SELECT MAX(salary) FROM employees;`	`8000`

Useful Keywords

Keyword	Description	Example Usage	Result
`CASE`	Provides conditional logic in SQL queries.	`SELECT CASE WHEN age > 18 THEN 'Adult' ELSE 'Minor' END FROM users;`	`Adult` or `Minor`
`COALESCE()`	Returns the first non-null value in a list of arguments.	`SELECT COALESCE(NULL, NULL, 'Hello', 'World');`	`Hello`
`NULLIF()`	Returns NULL if two expressions are equal.	`SELECT NULLIF(10, 10);`	`NULL`
`ISNULL()`	Replaces NULL with a specified replacement value.	`SELECT ISNULL(NULL, 'Default');`	`Default`
`GREATEST()`	Returns the greatest value among the arguments.	`SELECT GREATEST(1, 2, 3);`	`3`
`LEAST()`	Returns the smallest value among the arguments.	`SELECT LEAST(1, 2, 3);`	`1`
`DISTINCT`	Removes duplicate values from the result set.	`SELECT DISTINCT column_name FROM table_name;`	Varies
`LIMIT`	Limits the number of rows returned by a query.	`SELECT * FROM table_name LIMIT 10;`	First 10 rows
`OFFSET`	Skips a number of rows before starting to return rows.	`SELECT * FROM table_name LIMIT 10 OFFSET 20;`	Rows 21-30
`JOIN`	Combines rows from two or more tables based on a related column.	`SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;`	Combined rows
`GROUP BY`	Groups rows that have the same values into summary rows.	`SELECT department, COUNT(*) FROM employees GROUP BY department;`	Count per department
`HAVING`	Filters records after grouping them with `GROUP BY`.	`SELECT department, COUNT() FROM employees GROUP BY department HAVING COUNT() > 5;`	Groups with more than 5 employees
`ORDER BY`	Sorts the result set by one or more columns.	`SELECT * FROM table_name ORDER BY column_name ASC;`	Sorted rows
`UNION`	Combines the result sets of two or more `SELECT` statements.	`SELECT column_name FROM table1 UNION SELECT column_name FROM table2;`	Combined result sets

Regex Patterns

Pattern	Description	Example Usage	Result
`.`	Matches any single character except newline.	`SELECT REGEXP_MATCH('Hello', '.');`	`H`, `e`, `l`, `l`, `o`
`^`	Matches the start of a string.	`SELECT REGEXP_MATCH('Hello', '^H');`	`H`
`$`	Matches the end of a string.	`SELECT REGEXP_MATCH('Hello', 'o$');`	`o`
`*`	Matches 0 or more of the preceding element.	`SELECT REGEXP_MATCH('Hello', 'l*');`	`l`, `l`
`+`	Matches 1 or more of the preceding element.	`SELECT REGEXP_MATCH('Hello', 'l+');`	`l`, `l`
`?`	Matches 0 or 1 of the preceding element.	`SELECT REGEXP_MATCH('Hello', 'l?');`	`l`, ``
`{n}`	Matches exactly n occurrences of the preceding element.	`SELECT REGEXP_MATCH('Hello', 'l{2}');`	`ll`
`{n,}`	Matches n or more occurrences of the preceding element.	`SELECT REGEXP_MATCH('Helloo', 'o{2,}');`	`oo`
`{n,m}`	Matches between n and m occurrences of the preceding element.	`SELECT REGEXP_MATCH('Hellooo', 'o{2,4}');`	`ooo`
`[]`	Matches any one of the characters inside the brackets.	`SELECT REGEXP_MATCH('Hello', '[Hh]ello');`	`Hello`, `hello`
`[^ ]`	Matches any one character not inside the brackets.	`SELECT REGEXP_MATCH('Hello', '[^H]ello');`	`ello`
`	`	Acts as an OR operator.	`SELECT REGEXP_MATCH('Hello', 'H
`()`	Groups patterns together.	`SELECT REGEXP_MATCH('Hello', '(H	h)ello');`
`\`	Escapes a special character.	`SELECT REGEXP_MATCH('Hello$', 'Hello\$');`	`Hello$`
`\d`	Matches any digit (0-9).	`SELECT REGEXP_MATCH('123', '\d');`	`1`, `2`, `3`
`\D`	Matches any non-digit character.	`SELECT REGEXP_MATCH('abc', '\D');`	`a`, `b`, `c`
`\w`	Matches any word character (alphanumeric & underscore).	`SELECT REGEXP_MATCH('word_123', '\w');`	`w`, `o`, `r`, `d`, `_`, `1`, `2`, `3`
`\W`	Matches any non-word character.	`SELECT REGEXP_MATCH('word_123!', '\W');`	`_`, `!`
`\s`	Matches any whitespace character.	`SELECT REGEXP_MATCH('hello world', '\s');`
`\S`	Matches any non-whitespace character.	`SELECT REGEXP_MATCH('hello world', '\S');`	`h`, `e`, `l`, `l`, `o`, `w`, `o`, `r`, `l`, `d`

More Regex patterns and Use Cases

Pattern	Description	Example Usage	Result
`^\d+$`	Matches strings that contain only digits.	`SELECT REGEXP_MATCH('12345', '^\d+$');`	`12345`
`\bword\b`	Matches the word 'word' as a whole word (word boundary).	`SELECT REGEXP_MATCH('word', '\bword\b');`	`word`
`(\d{2}-\d{2}-\d{4})`	Matches dates in the format 'DD-MM-YYYY'.	`SELECT REGEXP_MATCH('12-01-2024', '(\d{2}-\d{2}-\d{4})');`	`12-01-2024`
`(\d+)\s(\w+)`	Matches a number followed by a space and then a word.	`SELECT REGEXP_MATCH('123 hello', '(\d+)\s(\w+)');`	`123 hello`
`(?i)hello`	Case-insensitive match for the word 'hello'.	`SELECT REGEXP_MATCH('Hello', '(?i)hello');`	`Hello`
`^([a-z]+)\1$`	Matches strings with repeating sequences of lowercase letters.	`SELECT REGEXP_MATCH('abab', '^([a-z]+)\1$');`	`abab`
`(?:abc	def)\d+`	Matches either 'abc' or 'def' followed by one or more digits.	`SELECT REGEXP_MATCH('def123', '(?:abc
`(?<=\s)\d+`	Matches a number that is preceded by a whitespace character (positive lookbehind).	`SELECT REGEXP_MATCH('The price is 100', '(?<=\s)\d+');`	`100`
`(?=\d{2})\d+`	Matches a number if it is followed by exactly two digits (positive lookahead).	`SELECT REGEXP_MATCH('12345', '(?=\d{2})\d+');`	`12345`
`\b(?:\d{3}-\d{2}-\d{4})\b`	Matches a social security number format 'XXX-XX-XXXX' as a whole word.	`SELECT REGEXP_MATCH('123-45-6789', '\b(?:\d{3}-\d{2}-\d{4})\b');`	`123-45-6789`

Regex for Preprocessing Challenges

Challenge	Example Pattern	Description	SQL Example
Extracting email addresses	`[\w\.-]+@[\w\.-]+\.\w+`	Matches standard email addresses.	`SELECT REGEXP_MATCH(text_column, '[\w\.-]+@[\w\.-]+\.\w+');`
Removing non-alphanumeric characters	`[^\w\s]`	Removes punctuation and special characters.	`SELECT REGEXP_REPLACE(text_column, '[^\w\s]', '');`
Validating phone numbers in various formats	`$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}`	Matches phone numbers with different separators.	`SELECT REGEXP_MATCH(phone_column, '$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}');`
Extracting dates in 'YYYY-MM-DD' format	`\d{4}-\d{2}-\d{2}`	Matches dates in 'YYYY-MM-DD' format.	`SELECT REGEXP_MATCH(date_column, '\d{4}-\d{2}-\d{2}');`
Splitting full names into first and last names	`(\w+)\s(\w+)`	Matches and splits full names into first and last names.	`SELECT REGEXP_MATCH(full_name_column, '(\w+)\s(\w+)');`
Extracting IP addresses	`\b(?:\d{1,3}\.){3}\d{1,3}\b`	Matches IPv4 addresses.	`SELECT REGEXP_MATCH(ip_column, '\b(?:\d{1,3}\.){3}\d{1,3}\b');`
Identifying URLs	`https?://[^\s/$.?#].[^\s]*`	Matches URLs starting with http or https.	`SELECT REGEXP_MATCH(url_column, 'https?://[^\s/$.?#].[^\s]*');`
Validating credit card numbers	`\b(?:\d[ -]*?){13,16}\b`	Matches credit card numbers with 13 to 16 digits.	`SELECT REGEXP_MATCH(cc_column, '\b(?:\d[ -]*?){13,16}\b');`
Matching hashtags	`#\w+`	Matches hashtags starting with `#`.	`SELECT REGEXP_MATCH(text_column, '#\w+');`
Extracting monetary amounts	`\$\d+(?:,\d{3})*(?:\.\d{2})?`	Matches monetary amounts with optional cents.	`SELECT REGEXP_MATCH(amount_column, '\$\d+(?:,\d{3})*(?:\.\d{2})?');`
Finding repeated words	`\b(\w+)\s+\1\b`	Matches repeated words.	`SELECT REGEXP_MATCH(text_column, '\b(\w+)\s+\1\b');`
Identifying hexadecimal color codes	`#(?:[0-9a-fA-F]{3}){1,2}`	Matches hexadecimal color codes.	`SELECT REGEXP_MATCH(color_column, '#(?:[0-9a-fA-F]{3}){1,2}');`
Extracting domain names from URLs	`https?://(?:www\.)?([^/\s]+)`	Extracts domain names from URLs.	`SELECT REGEXP_MATCH(url_column, 'https?://(?:www\.)?([^/\s]+)');`
Matching US ZIP codes	`\b\d{5}(?:-\d{4})?\b`	Matches standard and extended US ZIP codes.	`SELECT REGEXP_MATCH(zip_column, '\b\d{5}(?:-\d{4})?\b');`
Extracting file extensions	`\.\w+$`	Matches file extensions.	`SELECT REGEXP_MATCH(file_column, '\.\w+$');`
Validating URLs with specific TLDs	`https?://[^\s/$.?#].[^\s]*.(com	org	net
Removing HTML tags	`<[^>]+>`	Removes HTML tags from text.	`SELECT REGEXP_REPLACE(html_column, '<[^>]+>', '');`
Matching dates in 'MM/DD/YYYY' format	`(?:0[1-9]	1[0-2])/(?:0[1-9]	[12][0-9]
Finding IP addresses in text	`(?:\d{1,3}\.){3}\d{1,3}`	Matches any IP address.	`SELECT REGEXP_MATCH(text_column, '(?:\d{1,3}\.){3}\d{1,3}');`

SQL Order of execution

Stage	Description	Example Query Component	References
1. FROM	Determines the tables and joins used in the query.	`FROM table1 INNER JOIN table2 ON table1.id = table2.id`	MySQL Documentation: FROM
2. ON	Applies the conditions for the join operation.	`ON table1.id = table2.id`	MySQL Documentation: JOIN
3. JOIN	Processes the joins as specified in the FROM clause.	`INNER JOIN table2`	MySQL Documentation: JOIN
4. WHERE	Filters the rows based on specified conditions.	`WHERE table1.value > 100`	MySQL Documentation: WHERE
5. GROUP BY	Groups the rows based on specified columns.	`GROUP BY table1.category`	MySQL Documentation: GROUP BY
6. HAVING	Filters the groups based on specified conditions.	`HAVING COUNT(*) > 5`	MySQL Documentation: HAVING
7. SELECT	Selects the columns or expressions to be returned.	`SELECT table1.id, COUNT(table2.id)`	MySQL Documentation: SELECT
8. DISTINCT	Removes duplicate rows from the result set.	`SELECT DISTINCT column_name`	MySQL Documentation: DISTINCT
9. ORDER BY	Sorts the result set based on specified columns.	`ORDER BY table1.value DESC`	MySQL Documentation: ORDER BY
10. LIMIT	Limits the number of rows returned by the query.	`LIMIT 10`	MySQL Documentation: LIMIT
11. OFFSET	Skips a specified number of rows before starting to return rows (used with LIMIT).	`LIMIT 10 OFFSET 20`	MySQL Documentation: LIMIT
12. UNION	Combines results from multiple SELECT statements (if used).	`SELECT column_name FROM table1 UNION SELECT column_name FROM table2`	MySQL Documentation: UNION
13. Friendly Jellyfish Will Gather Happy Sea Dolphins Over Lagoons	-	-	-

Problem Solving Tips

Condition	Tip
Each keyword	Put that column-related table in the LEFT JOIN
Nth highest salary	Use `DENSE_RANK()` to avoid gaps; alternatives include using two `MAX()` functions, two `LIMIT` clauses, or `DENSE_RANK()` with CTE
Null based salary	Use `IFNULL(DISTINCT SELECT CONDITION, NULL) AS destination_column_name`
Consecutive numbers	1) Use EQUI JOIN with `(current_table.field = duplicate_table1.field + 1)` and `(current_table.field = duplicate_table2.field + 2)` 2) Use `RANK()` and `SUM()` with a check that the sum equals exactly 3; Group by the interested field
Parent/Child relationship	EQUI JOIN based on associated columns and filter using conditions in the JOIN itself
Finding duplicates	Group by key and use `HAVING > some_value` Alternatively, use `ROW_NUMBER()` on key and filter `RN > 1`
Find results from Base Table not present	Use JOIN results where not in (entire other table results) Alternatively, use CTE results and then JOIN results
Find results from other table	Left anti join to get results from the first table; use `LEFT JOIN` where the `other_table.join_column` is `NULL` (e.g., in PySpark)
Find results from based table using condition	Use JOIN results where `IN (condition-based results)` For example, find max top salaried employees
Target column (compare) previous all values	Use EQUI JOIN `(WHERE)` with `DATE_DIFF()` and conditions Alternatively, use `INNER JOIN (ON)` with `DATE_DIFF()` and conditions
First logged date	Use `MIN(date)`
Sum of all previous until current column	Use `SUM(target_count_column) OVER (PARTITION BY column ORDER BY column ASC/DESC) AS target_count_until_current_row` Alternatively, use EQUI JOIN with `(current column condition, other table column)` (e.g., `t1.event_date > t2.event_date`)
FILTER CONDITION (If needs null value in result)	Use `WHERE condition IS NULL` to include NULL results in the output
ORDER BY CTE (2 derived columns from it)	Order by the derived columns (e.g., `ORDER BY derived_column1, derived_column2`)
HAVING	Use `HAVING` for filtration based on group by columns or specific conditions
IFNULL(expression, value_if_null)	Use `IFNULL(expression, value_if_null)` to handle null values in results
expression IS NULL	Use `SELECT employee_id, name FROM employees WHERE bonus IS NULL;`
Indexes	Use indexes on columns related to JOIN, FILTER, and GROUP BY operations
Avoid full table scans	Optimize queries to prevent full table scans
Simplify queries	Use CTEs to simplify complex queries

DB Query Comparison

SQL Recursive Queries

Concept	Description	MySQL Implementation	Oracle SQL Implementation	PostgreSQL Implementation	Use Cases	Sources
Recursive Common Table Expressions (CTEs)	Recursive CTEs are used to handle hierarchical data and perform recursive queries.	`WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name;`	`WITH cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name;`	`WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... ) SELECT * FROM cte_name;`	Hierarchical data, organizational charts, directory structures	HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Hierarchical Queries	Specialized queries to manage hierarchical data, often using CONNECT BY.	Not natively supported; typically use recursive CTEs or application-level logic.	`SELECT ... CONNECT BY PRIOR ...`	Achieved using recursive CTEs similar to MySQL.	Tree structures, bill of materials, family trees	HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Managing Recursive Queries	Techniques to manage and optimize recursive queries, such as limiting recursion depth.	`WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... WHERE ... ) SELECT * FROM cte_name;`	`SELECT ... CONNECT BY PRIOR ... AND LEVEL <= n;`	`WITH RECURSIVE cte_name AS ( SELECT ... UNION ALL SELECT ... WHERE ... ) SELECT * FROM cte_name;`	Performance optimization, preventing infinite loops	HackerEarth, DataCamp, KDnuggets, FreeCodeCamp
Practical Examples and Use Cases	Real-world examples of recursive queries, such as finding all subordinates in an organizational hierarchy.	`WITH RECURSIVE EmployeeHierarchy AS ( SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, h.level + 1 FROM employees e INNER JOIN EmployeeHierarchy h ON e.manager_id = h.id ) SELECT * FROM EmployeeHierarchy;`	`SELECT id, name, manager_id FROM employees CONNECT BY PRIOR id = manager_id;`	`WITH RECURSIVE EmployeeHierarchy AS ( SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL SELECT e.id, e.name, e.manager_id, h.level + 1 FROM employees e INNER JOIN EmployeeHierarchy h ON e.manager_id = h.id ) SELECT * FROM EmployeeHierarchy;`	Employee hierarchies, category hierarchies in e-commerce	HackerEarth, DataCamp, KDnuggets, FreeCodeCamp

Cloud Knowledge

Projects
Batch_project
AWS_E2E_Batch_Pipeline_Reddit
Stream_project
AWS_E2E_Realtime_Pipeline_Yelp

Certifications

AWS Certification
AWS Certified Data Analytics

GCP Certification
GCP Professional Data Engineer

Blog posts

Streaming
AWS Kinesis + Firehose Hands-on
AWS Kinesis + Multi Lambda
Batch
Data Migration + MySQLDump
Uncompress S3 using Lambda
AWS Kinesis + Multi Lambda
GCP cloud SQL+ AWS DMS + AWS RDS
Programming
Python Essential Concepts

Leetcode Knowledge

Click item below to expand

Common Data Structures and Algorithms Patterns

Pattern	Description	Use Cases
Sliding Window	Perform operations on a specific window size within a data structure, such as finding the longest subarray. Usage: Use when you need to perform operations on a subset of a linear data structure (array, string) of fixed size. Clue: Look for problems asking to find longest/shortest substring, subarray, or desired value in a linear structure.	Linear data structures (arrays, strings)
Two Pointers	Use two pointers iterating through a data structure in tandem to find pairs or subarrays fulfilling conditions. Usage: Use when you need to find pairs or subarrays that meet certain conditions in sorted arrays or linked lists. Clue: Useful for problems involving comparisons between elements, pairs, triplets, or subarrays in sorted structures.	Sorted arrays or linked lists
Fast and Slow Pointers	Use two pointers moving at different speeds to detect cycles in linked lists or arrays. Usage: Use to detect cycles in linked lists or arrays, or find the middle of a linked list. Clue: Look for cyclic linked list/array problems or when needing to find midpoints or check for palindromes.	Cyclic data structures
Merge Intervals	Efficiently handle overlapping intervals by merging or finding overlaps. Usage: Use for problems involving overlapping intervals that need merging or identifying. Clue: Look for problems mentioning overlapping intervals or requiring exclusive intervals.	Intervals, scheduling
Cyclic Sort	Sort an array containing numbers in a given range by swapping elements to their correct positions. Usage: Use for sorting arrays with numbers in a given range or finding missing/duplicate numbers. Clue: Look for problems involving sorted arrays with given ranges or finding specific numbers.	Sorted arrays, finding missing/duplicate numbers
In-place Reversal	Reverse links between nodes of a linked list without using extra memory. Usage: Use when needing to reverse links in a linked list without using extra memory. Clue: Look for constraints requiring in-place operations on linked lists.	Linked lists
Tree BFS	Traverse a tree level by level using a queue. Usage: Use for traversing trees level by level. Clue: Look for problems asking for level-order traversal or working with tree levels.	Tree structures
Tree DFS	Traverse a tree using depth-first search, using recursion or a stack. Usage: Use for depth-first traversal of trees, especially when needing to process nodes pre/in/post-order. Clue: Look for tree traversal problems requiring in-order, pre-order, or post-order processing.	Tree structures
Two Heaps	Use a Min Heap and Max Heap to keep track of elements and find the median efficiently. Usage: Use for problems needing to find median elements or split data into two parts efficiently. Clue: Look for problems mentioning finding smallest/largest/median elements.	Finding smallest/largest/median elements
Subsets	Use BFS to handle permutations and combinations of a given set of elements. Usage: Use for problems involving generating permutations and combinations of a set of elements. Clue: Look for problems requiring combinations or permutations of elements.	Combinations, permutations
Modified Binary Search	Efficiently search for an element in a sorted array, linked list, or matrix. Usage: Use for searching elements in sorted data structures (arrays, lists, matrices). Clue: Look for problems involving sorted data and requiring efficient searching.	Sorted data structures
Top K Elements	Use a Heap to find the top/smallest/frequent 'K' elements from a set. Usage: Use for finding the top/smallest/frequent 'K' elements in a set. Clue: Look for problems asking for top/smallest/frequent 'K' elements.	Finding top/smallest/frequent 'K' elements
K-way Merge	Use a Heap to efficiently merge 'K' sorted arrays or lists. Usage: Use for merging 'K' sorted arrays or lists efficiently. Clue: Look for problems involving merging sorted lists or arrays.	Merging sorted arrays/lists
Topological Sort	Perform topological sorting on a directed graph with no cycles to find a linear ordering of elements. Usage: Use for sorting elements with dependencies in a directed graph without cycles. Clue: Look for problems dealing with graphs, dependencies, or requiring sorted order of elements with dependencies.	Graphs, dependency resolution

Sliding Window

Field	Details
Description	Perform operations on a specific window size within a data structure, such as finding the longest subarray.
Usage	Use when you need to perform operations on a subset of a linear data structure (array, string) of fixed size.
Clue	Look for problems asking to find longest/shortest substring, subarray, or desired value in a linear structure.
Use Cases	Linear data structures (arrays, strings)

Two Pointers

Field	Details
Description	Use two pointers iterating through a data structure in tandem to find pairs or subarrays fulfilling conditions.
Usage	Use when you need to find pairs or subarrays that meet certain conditions in sorted arrays or linked lists.
Clue	Useful for problems involving comparisons between elements, pairs, triplets, or subarrays in sorted structures.
Use Cases	Sorted arrays or linked lists

Fast and Slow Pointers

Field	Details
Description	Use two pointers moving at different speeds to detect cycles in linked lists or arrays.
Usage	Use to detect cycles in linked lists or arrays, or find the middle of a linked list.
Clue	Look for cyclic linked list/array problems or when needing to find midpoints or check for palindromes.
Use Cases	Cyclic data structures

Merge Intervals

Field	Details
Description	Efficiently handle overlapping intervals by merging or finding overlaps.
Usage	Use for problems involving overlapping intervals that need merging or identifying.
Clue	Look for problems mentioning overlapping intervals or requiring exclusive intervals.
Use Cases	Intervals, scheduling

Cyclic Sort

Field	Details
Description	Sort an array containing numbers in a given range by swapping elements to their correct positions.
Usage	Use for sorting arrays with numbers in a given range or finding missing/duplicate numbers.
Clue	Look for problems involving sorted arrays with given ranges or finding specific numbers.
Use Cases	Sorted arrays, finding missing/duplicate numbers

In-place Reversal

Field	Details
Description	Reverse links between nodes of a linked list without using extra memory.
Usage	Use when needing to reverse links in a linked list without using extra memory.
Clue	Look for constraints requiring in-place operations on linked lists.
Use Cases	Linked lists

Tree BFS

Field	Details
Description	Traverse a tree level by level using a queue.
Usage	Use for traversing trees level by level.
Clue	Look for problems asking for level-order traversal or working with tree levels.
Use Cases	Tree structures

Tree DFS

Field	Details
Description	Traverse a tree using depth-first search, using recursion or a stack.
Usage	Use for depth-first traversal of trees, especially when needing to process nodes pre/in/post-order.
Clue	Look for tree traversal problems requiring in-order, pre-order, or post-order processing.
Use Cases	Tree structures

Two Heaps

Field	Details
Description	Use a Min Heap and Max Heap to keep track of elements and find the median efficiently.
Usage	Use for problems needing to find median elements or split data into two parts efficiently.
Clue	Look for problems mentioning finding smallest/largest/median elements.
Use Cases	Finding smallest/largest/median elements

Subsets

Field	Details
Description	Use BFS to handle permutations and combinations of a given set of elements.
Usage	Use for problems involving generating permutations and combinations of a set of elements.
Clue	Look for problems requiring combinations or permutations of elements.
Use Cases	Combinations, permutations

Modified Binary Search

Field	Details
Description	Efficiently search for an element in a sorted array, linked list, or matrix.
Usage	Use for searching elements in sorted data structures (arrays, lists, matrices).
Clue	Look for problems involving sorted data and requiring efficient searching.
Use Cases	Sorted data structures

Top K Elements

Field	Details
Description	Use a Heap to find the top/smallest/frequent 'K' elements from a set.
Usage	Use for finding the top/smallest/frequent 'K' elements in a set.
Clue	Look for problems asking for top/smallest/frequent 'K' elements.
Use Cases	Finding top/smallest/frequent 'K' elements

K-way Merge

Field	Details
Description	Use a Heap to efficiently merge 'K' sorted arrays or lists.
Usage	Use for merging 'K' sorted arrays or lists efficiently.
Clue	Look for problems involving merging sorted lists or arrays.
Use Cases	Merging sorted arrays/lists

Topological Sort

Field	Details
Description	Perform topological sorting on a directed graph with no cycles to find a linear ordering of elements.
Usage	Use for sorting elements with dependencies in a directed graph without cycles.
Clue	Look for problems dealing with graphs, dependencies, or requiring sorted order of elements with dependencies.
Use Cases	Graphs, dependency resolution

Please Note: This content is self-made work (Some Assistance of AI involved). Please do not reproduce or copy any part of this document without providing proper credits about this GitHub profile. Thank you for respecting intellectual property rights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prasanna kumar prasku5

Achievements

Achievements

Highlights

Block or report prasku5

Python Knowledge

SQL Knowledge

Cloud Knowledge

Certifications

Blog posts

Leetcode Knowledge

Pinned Loading