# Monthly Transactions II
 
Table: Transactions

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | id             | int     |
    | country        | varchar |
    | state          | enum    |
    | amount         | int     |
    | trans_date     | date    |
    +----------------+---------+
    id is the column of unique values of this table.
    The table has information about incoming transactions.
    The state column is an ENUM (category) of type ["approved", "declined"].
Table: Chargebacks

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | trans_id       | int     |
    | trans_date     | date    |
    +----------------+---------+
    Chargebacks contains basic information regarding incoming chargebacks from some transactions placed in Transactions table.
    trans_id is a foreign key (reference column) to the id column of Transactions table.
    Each chargeback corresponds to a transaction made previously even if they were not approved.


    Write a solution to find for each month and country: the number of approved transactions and their total amount, the number of chargebacks, and their total amount.

    Note: In your solution, given the month and country, ignore rows with all zeros.

    Return the result table in any order.

    The result format is in the following example.

 

Example 1:

Input: 
Transactions table:

    +-----+---------+----------+--------+------------+
    | id  | country | state    | amount | trans_date |
    +-----+---------+----------+--------+------------+
    | 101 | US      | approved | 1000   | 2019-05-18 |
    | 102 | US      | declined | 2000   | 2019-05-19 |
    | 103 | US      | approved | 3000   | 2019-06-10 |
    | 104 | US      | declined | 4000   | 2019-06-13 |
    | 105 | US      | approved | 5000   | 2019-06-15 |
    +-----+---------+----------+--------+------------+
Chargebacks table:

    +----------+------------+
    | trans_id | trans_date |
    +----------+------------+
    | 102      | 2019-05-29 |
    | 101      | 2019-06-30 |
    | 105      | 2019-09-18 |
    +----------+------------+
Output: 

    +---------+---------+----------------+-----------------+------------------+-------------------+
    | month   | country | approved_count | approved_amount | chargeback_count | chargeback_amount |
    +---------+---------+----------------+-----------------+------------------+-------------------+
    | 2019-05 | US      | 1              | 1000            | 1                | 2000              |
    | 2019-06 | US      | 2              | 8000            | 1                | 1000              |
    | 2019-09 | US      | 0              | 0               | 1                | 5000              |
    +---------+---------+----------------+-----------------+------------------+-------------------+
 
Initial Ideas

    The goal of the monthly_transactions function is to process two data tables: one for transactions and another for chargebacks. The function aims to summarize the number of approved transactions and chargebacks by month and country. The output should reflect counts and amounts for both approved transactions and chargebacks, allowing for an effective comparison.

Steps

    Data Preparation: Convert the transaction dates to a consistent format representing year and month (YYYY-MM).
    Filtering Approved Transactions: Keep only the rows where transactions have an approved status.
    Aggregation of Approved Transactions: Group by month and country, counting the number of approved transactions and summing their amounts.
    Processing Chargebacks: Merge chargebacks with transactions to include relevant transaction details, and group by month and country to aggregate chargeback data.
    Combining Results: Merge the summary of approved transactions with the chargeback summary, filling any missing values with zeros.
    Formatting the Output: Ensure the output month format is correct and return the final DataFrame.    
    
Edge Cases

    No Transactions: If the transactions DataFrame is empty, the output should only contain chargeback information if available.
    No Chargebacks: If the chargebacks DataFrame is empty, the output should reflect only approved transactions with chargeback counts and amounts set to zero.
    Same Month for Multiple Transactions: Ensure that the aggregation works correctly when multiple transactions or chargebacks occur in the same month.
    
Complexity Analysis

    Time Complexity: The function primarily involves filtering and aggregating DataFrames, resulting in a complexity of O(n log n) due to the grouping and aggregation operations, where n is the number of transactions and chargebacks.
    Space Complexity: The space complexity is O(m + k), where m is the number of unique months for transactions and chargebacks, and k is the number of countries.
    
Follow-Up Questions and Answers
Q: What would happen if there are duplicate transactions?

    A: The function assumes that transactions are unique by id. Duplicate entries could skew the results, so it’s advisable to handle duplicates before processing.
Q: How would you modify the function to handle multiple countries?

    A: The function already supports multiple countries through its grouping operations. Additional countries would be naturally included in the aggregation.
Q: How can we extend this function to include a comparison with previous months?

    A: We could add additional logic to calculate differences between months by storing the previous month’s totals in a separate DataFrame and then performing a join or calculation.

In [None]:
import pandas as pd

def monthly_transactions(transactions: pd.DataFrame, chargebacks: pd.DataFrame) -> pd.DataFrame:
    transactions['trans_date'] = pd.to_datetime(transactions['trans_date'])
    chargebacks['trans_date'] = pd.to_datetime(chargebacks['trans_date'])

    # Step 1: Format transaction dates to month format
    transactions['trans_date'] = transactions['trans_date'].dt.strftime('%y-%m')
    
    # Step 2: Merge chargebacks with transactions
    chargebacks = chargebacks.merge(transactions, left_on='trans_id', right_on='id', how='inner')[['trans_id','trans_date','country','amount']]

    # Step 3: Filter approved transactions
    transactions = transactions[transactions['state'] == 'approved']
    
    # Step 4: Group approved transactions
    result = transactions.groupby(['trans_date', 'country']).agg(
        approved_count=('state', 'count'),
        approved_amount=('amount', 'sum')
    ).reset_index().rename(columns={'trans_date': 'month'})

    # Step 5: Process chargebacks
    chargebacks['month'] = chargebacks['trans_date'].dt.strftime('%y-%m')
    chargebacks = chargebacks.groupby(['month', 'country']).agg(
        chargeback_count=('trans_id', 'count'),
        chargeback_amount=('amount', 'sum')
    ).reset_index()
    
    # Step 6: Combine results
    combine_df = result.merge(chargebacks, on=['month', 'country'], how='outer').fillna(0)
    
    # Formatting month
    combine_df['month'] = '20' + combine_df['month']
    
    return combine_df

# Example input data
transactions_data = {
    'id': [101, 102, 103, 104, 105],
    'country': ['US', 'US', 'US', 'US', 'US'],
    'state': ['approved', 'declined', 'approved', 'declined', 'approved'],
    'amount': [1000, 2000, 3000, 4000, 5000],
    'trans_date': ['2019-05-18', '2019-05-19', '2019-06-10', '2019-06-13', '2019-06-15']
}

chargebacks_data = {
    'trans_id': [102, 101, 105],
    'trans_date': ['2019-05-29', '2019-06-30', '2019-09-18']
}

# Create DataFrames
transactions_df = pd.DataFrame(transactions_data)
chargebacks_df = pd.DataFrame(chargebacks_data)

# Get monthly transactions
result_df = monthly_transactions(transactions_df, chargebacks_df)
print(result_df)


# Game Play Analysis III
 
Table: Activity

    +--------------+---------+
    | Column Name  | Type    |
    +--------------+---------+
    | player_id    | int     |
    | device_id    | int     |
    | event_date   | date    |
    | games_played | int     |
    +--------------+---------+
    (player_id, event_date) is the primary key (column with unique values) of this table.
    This table shows the activity of players of some games.
    Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.


    Write a solution to report for each player and date, how many games played so far by the player. That is, the total number of games played by the player until that date. Check the example for clarity.

    Return the result table in any order.

    The result format is in the following example.

 

Example 1:

Input: 
Activity table:

    +-----------+-----------+------------+--------------+
    | player_id | device_id | event_date | games_played |
    +-----------+-----------+------------+--------------+
    | 1         | 2         | 2016-03-01 | 5            |
    | 1         | 2         | 2016-05-02 | 6            |
    | 1         | 3         | 2017-06-25 | 1            |
    | 3         | 1         | 2016-03-02 | 0            |
    | 3         | 4         | 2018-07-03 | 5            |
    +-----------+-----------+------------+--------------+
Output: 

    +-----------+------------+---------------------+
    | player_id | event_date | games_played_so_far |
    +-----------+------------+---------------------+
    | 1         | 2016-03-01 | 5                   |
    | 1         | 2016-05-02 | 11                  |
    | 1         | 2017-06-25 | 12                  |
    | 3         | 2016-03-02 | 0                   |
    | 3         | 2018-07-03 | 5                   |
    +-----------+------------+---------------------+
Explanation: 

    For the player with id 1, 5 + 6 = 11 games played by 2016-05-02, and 5 + 6 + 1 = 12 games played by 2017-06-25.
    For the player with id 3, 0 + 5 = 5 games played by 2018-07-03.
    Note that for each player we only care about the days when the player logged in.

Explanation of the Code

    Sorting: The first step sorts the DataFrame by player_id and event_date. This is essential because we need to calculate the cumulative sum of games played in chronological order for each player.

    Cumulative Sum Calculation: We use the groupby method to group the data by player_id and then apply cumsum() on the games_played column. This calculates the cumulative number of games played by each player up to each event date.

    Selecting Relevant Columns: After computing the cumulative sum, we create a new DataFrame called result that contains only the player_id, event_date, and the newly computed games_played_so_far.

Edge Cases

    No Activity: If the activity DataFrame is empty, the function will return an empty DataFrame.
    Multiple Entries on the Same Day: If a player logs multiple entries on the same day, the cumulative sum will consider all entries for that day.
    Different Players: The implementation inherently supports multiple players, ensuring that the cumulative sums are calculated independently for each player.
    
Complexity Analysis

    Time Complexity: The sorting operation dominates the complexity, making it O(n log n), where n is the number of rows in the DataFrame. The grouping and cumulative sum operation is O(n).
    Space Complexity: The space complexity is O(n) for storing the cumulative sums in the new column.

Follow-Up Questions and Answers
Q: How would the function handle players with no games played?

    A: The function will correctly output zero for those players on their event dates.
Q: Can this function be adapted for more detailed metrics?

    A: Yes, it could be expanded to include metrics like average games played per session or total session time, requiring additional data.
Q: What would happen if games_played contained negative values?

    A: The cumulative sum would still calculate, but negative values could lead to incorrect totals. Data validation would be necessary to handle such cases.


In [1]:
import pandas as pd

def gameplay_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Sort the DataFrame by player_id and event_date
    activity = activity.sort_values(by=['player_id', 'event_date'])
    
    # Step 2: Group by player_id and calculate the cumulative sum of games_played
    activity['games_played_so_far'] = activity.groupby('player_id')['games_played'].cumsum()
    
    # Step 3: Select the relevant columns for output
    result = activity[['player_id', 'event_date', 'games_played_so_far']]
    
    return result

# Example usage
activity_data = {
    'player_id': [1, 1, 1, 3, 3],
    'device_id': [2, 2, 3, 1, 4],
    'event_date': ['2016-03-01', '2016-05-02', '2017-06-25', '2016-03-02', '2018-07-03'],
    'games_played': [5, 6, 1, 0, 5]
}

# Create DataFrame
activity_df = pd.DataFrame(activity_data)
activity_df['event_date'] = pd.to_datetime(activity_df['event_date'])

# Get the result
result_df = gameplay_analysis(activity_df)
print(result_df)


   player_id event_date  games_played_so_far
0          1 2016-03-01                    5
1          1 2016-05-02                   11
2          1 2017-06-25                   12
3          3 2016-03-02                    0
4          3 2018-07-03                    5


# Count Student Number in Departments
 
Table: Student

    +--------------+---------+
    | Column Name  | Type    |
    +--------------+---------+
    | student_id   | int     |
    | student_name | varchar |
    | gender       | varchar |
    | dept_id      | int     |
    +--------------+---------+
    student_id is the primary key (column with unique values) for this table.
    dept_id is a foreign key (reference column) to dept_id in the Department tables.
    Each row of this table indicates the name of a student, their gender, and the id of their department.


Table: Department

    +-------------+---------+
    | Column Name | Type    |
    +-------------+---------+
    | dept_id     | int     |
    | dept_name   | varchar |
    +-------------+---------+
    dept_id is the primary key (column with unique values) for this table.
    Each row of this table contains the id and the name of a department.


    Write a solution to report the respective department name and number of students majoring in each department for all departments in the Department table (even ones with no current students).

    Return the result table ordered by student_number in descending order. In case of a tie, order them by dept_name alphabetically.

    The result format is in the following example.

 

Example 1:

Input: 
Student table:

    +------------+--------------+--------+---------+
    | student_id | student_name | gender | dept_id |
    +------------+--------------+--------+---------+
    | 1          | Jack         | M      | 1       |
    | 2          | Jane         | F      | 1       |
    | 3          | Mark         | M      | 2       |
    +------------+--------------+--------+---------+
Department table:

    +---------+-------------+
    | dept_id | dept_name   |
    +---------+-------------+
    | 1       | Engineering |
    | 2       | Science     |
    | 3       | Law         |
    +---------+-------------+
Output: 

    +-------------+----------------+
    | dept_name   | student_number |
    +-------------+----------------+
    | Engineering | 2              |
    | Science     | 1              |
    | Law         | 0              |
    +-------------+----------------+
    
Initial Ideas

    The problem requires counting students in each department and ensuring that departments with no students are included in the output. This suggests a need for a left join between the Student and Department tables to retain all department records.

Steps

    Group and Count: First, group the Student DataFrame by dept_id to count the number of students in each department.
    Merge: Perform a left join with the Department DataFrame to associate department names with their respective student counts.
    Handle Missing Values: Replace NaN values in the student_number column with 0 to reflect departments with no students.
    Sort: Order the results by student_number in descending order, and alphabetically by dept_name for ties.
    Select Relevant Columns: Return only the dept_name and student_number columns.
    
Edge Cases

    No Students: If the Student table is empty, all departments should return with student_number as 0.
    No Departments: If the Department table is empty, the result should also be an empty DataFrame.
    All Departments Have Students: The output should still correctly count and sort as specified.
    Multiple Entries for the Same Student: Each student should be counted once; duplicates in the Student table should not inflate the counts.

Complexity

    Time Complexity: The overall complexity is O(n + m log m) where n is the number of students and m is the number of departments, mainly due to the sorting step.
    Space Complexity: The space complexity is O(m + n) because we create additional DataFrames to hold counts and results.

Follow-Up Questions and Answers

Q: How would you modify this to include more details about students?

    A: We could include additional columns from the Student DataFrame by modifying the merge and groupby operations to keep track of gender or other attributes.
Q: What if we wanted to group students by gender within each department?

    A: We would group the Student DataFrame by both dept_id and gender and then count students for each combination before merging with the Department table.
Q: How would you handle cases where student data may be invalid (e.g., null values)?

    A: We could add data cleaning steps before processing, such as dropping rows with null values in critical columns like dept_id.
Q: Can you suggest ways to optimize performance for very large datasets?

    A: Indexing the dept_id column in both DataFrames can improve join performance. Also, using more efficient data types or data storage formats (like Parquet) may help reduce memory usage and speed up operations.

In [2]:
import pandas as pd

def count_students(student: pd.DataFrame, department: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Group by department ID in the student DataFrame and count students
    student_count = student.groupby('dept_id').size().reset_index(name='student_number')
    
    # Step 2: Merge the student count with the department DataFrame
    result = pd.merge(department, student_count, how='left', left_on='dept_id', right_on='dept_id')
    
    # Step 3: Fill NaN values with 0 for departments without students
    result['student_number'] = result['student_number'].fillna(0).astype(int)
    
    # Step 4: Sort the results by student_number descending, then by dept_name alphabetically
    result = result.sort_values(by=['student_number', 'dept_name'], ascending=[False, True])
    
    # Step 5: Select relevant columns for output
    return result[['dept_name', 'student_number']]

# Sample Input Data
student_data = {
    'student_id': [1, 2, 3],
    'student_name': ['Jack', 'Jane', 'Mark'],
    'gender': ['M', 'F', 'M'],
    'dept_id': [1, 1, 2]
}

department_data = {
    'dept_id': [1, 2, 3],
    'dept_name': ['Engineering', 'Science', 'Law']
}

# Creating DataFrames
student_df = pd.DataFrame(student_data)
department_df = pd.DataFrame(department_data)

# Running the function
output_df = count_students(student_df, department_df)

# Displaying the output
print(output_df)


     dept_name  student_number
0  Engineering               2
1      Science               1
2          Law               0


# Shortest Distance in a Plane
 
Table: Point2D

    +-------------+------+
    | Column Name | Type |
    +-------------+------+
    | x           | int  |
    | y           | int  |
    +-------------+------+
    (x, y) is the primary key column (combination of columns with unique values) for this table.
    Each row of this table indicates the position of a point on the X-Y plane.


    The distance between two points p1(x1, y1) and p2(x2, y2) is sqrt((x2 - x1)2 + (y2 - y1)2).

    Write a solution to report the shortest distance between any two points from the Point2D table. Round the distance to two decimal points.

    The result format is in the following example.

 

Example 1:

Input: 
Point2D table:

    +----+----+
    | x  | y  |
    +----+----+
    | -1 | -1 |
    | 0  | 0  |
    | -1 | -2 |
    +----+----+
Output: 

    +----------+
    | shortest |
    +----------+
    | 1.00     |
    +----------+
Explanation: The shortest distance is 1.00 from point (-1, -1) to (-1, 2).

Problem Analysis

    The task is to find the shortest Euclidean distance between any two points in a plane, given a table of (x, y) coordinates. We need to handle this in a way that avoids redundant calculations, leverages efficient operations, and provides accurate, rounded results.

Initial Ideas

    Self-Join Approach: We can join the table with itself to get all possible pairs of points and calculate the distance for each pair. Then, we'll select the minimum distance.
    Avoiding Redundancy: Since distances are symmetric (distance from A to B is the same as from B to A), we only need unique pairs, not both directions.
    Vectorized Calculations: Using Pandas and NumPy for efficient, vectorized operations will help keep calculations fast.

Steps

    Create All Pairs: Use a self-join on the table to create all possible pairs of points.
    Filter Identical Pairs: Remove pairs where both points are the same, as the distance would be zero.
    Compute Distance: For each pair of points, calculate the Euclidean distance.
    Find Minimum Distance: Identify the shortest distance and round it to two decimal places.
    
Walkthrough Example

Input:

    Consider a table with the following points:.

    | x  | y  |
    |----|----|
    | -1 | -1 |
    |  0 |  0 |
    | -1 | -2 |

Execution:

    Self-Join:
        Creates all pairs:
            | x_1 | y_1 | x_2 | y_2 |
            |-----|-----|-----|-----|
            | -1  | -1  | -1  | -1  |
            | -1  | -1  |  0  |  0  |
            | -1  | -1  | -1  | -2  |
            |  0  |  0  | -1  | -1  |
            |  0  |  0  |  0  |  0  |
            |  0  |  0  | -1  | -2  |
            | -1  | -2  | -1  | -1  |
            | -1  | -2  |  0  |  0  |
            | -1  | -2  | -1  | -2  |
    
    Filter Identical Pairs:
        Remove pairs where (x_1, y_1) == (x_2, y_2).

    Calculate Distance:

        Apply the distance formula on remaining pairs, e.g., for (-1, -1) to (-1, -2), the distance is sqrt(((-1) - (-1))^2 + ((-2) - (-1))^2) = 1.0.
    Find Minimum: Identify the minimum value among calculated distances, which is 1.0.

Output:

    +----------+
    | shortest |
    +----------+
    | 1.00     |
    +----------+
Edge Cases

    Single Point: If there’s only one point, there’s no pair to calculate, so we should handle this case by returning an empty or null result.
    Identical Points: If all points are identical, the shortest distance would ideally be infinity or some indication of no valid pair.
    Multiple Closest Pairs: If there are multiple pairs with the same minimum distance, the function should still return that minimum without considering pair count.
    
Complexity Analysis

    Time Complexity: 
        O(n ^2), where n is the number of points, due to the need to examine all pairs.
    Space Complexity: 
        O(n^2 ) as well, for storing all pairs of points.
        
Follow-up Questions and Answers
What if we have a large dataset of points?

    Consider using KD-trees or spatial indexing structures, which can reduce the complexity of nearest-neighbor search.
How can this be optimized further?

    Precompute unique pairs only without performing a full Cartesian product, perhaps by using combinations from itertools or applying half-matrix traversal.
Would rounding earlier affect the accuracy?

    Rounding should only happen on the final answer to avoid accumulating rounding errors during intermediate steps.

In [None]:
import pandas as pd

def shortest_distance(point2_d: pd.DataFrame) -> pd.DataFrame:
    # Perform a self-join with different suffixes to get all pairs
    pairs = point2_d.merge(point2_d, how="cross", suffixes=('_1', '_2'))
    
    # Filter out pairs where the points are the same
    pairs = pairs[(pairs['x_1'] != pairs['x_2']) | (pairs['y_1'] != pairs['y_2'])]
    
    # Calculate Euclidean distance for each pair
    pairs['distance'] = np.sqrt((pairs['x_2'] - pairs['x_1'])**2 + (pairs['y_2'] - pairs['y_1'])**2)
    
    # Find the minimum distance and round to two decimal points
    shortest_distance = pairs['distance'].min().round(2)
    
    # Return result as a DataFrame with the specified format
    return pd.DataFrame({'shortest': [shortest_distance]})

# Unpopular Books
 
Table: Books

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | book_id        | int     |
    | name           | varchar |
    | available_from | date    |
    +----------------+---------+
    book_id is the primary key (column with unique values) of this table.
 

Table: Orders

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | order_id       | int     |
    | book_id        | int     |
    | quantity       | int     |
    | dispatch_date  | date    |
    +----------------+---------+
    order_id is the primary key (column with unique values) of this table.
    book_id is a foreign key (reference column) to the Books table.


    Write a solution to report the books that have sold less than 10 copies in the last year, excluding books that have been available for less than one month from today. Assume today is 2019-06-23.

    Return the result table in any order.

    The result format is in the following example.

 

Example 1:

Input: 
Books table:

    +---------+--------------------+----------------+
    | book_id | name               | available_from |
    +---------+--------------------+----------------+
    | 1       | "Kalila And Demna" | 2010-01-01     |
    | 2       | "28 Letters"       | 2012-05-12     |
    | 3       | "The Hobbit"       | 2019-06-10     |
    | 4       | "13 Reasons Why"   | 2019-06-01     |
    | 5       | "The Hunger Games" | 2008-09-21     |
    +---------+--------------------+----------------+
Orders table:

    +----------+---------+----------+---------------+
    | order_id | book_id | quantity | dispatch_date |
    +----------+---------+----------+---------------+
    | 1        | 1       | 2        | 2018-07-26    |
    | 2        | 1       | 1        | 2018-11-05    |
    | 3        | 3       | 8        | 2019-06-11    |
    | 4        | 4       | 6        | 2019-06-05    |
    | 5        | 4       | 5        | 2019-06-20    |
    | 6        | 5       | 9        | 2009-02-02    |
    | 7        | 5       | 8        | 2010-04-13    |
    +----------+---------+----------+---------------+
Output: 

    +-----------+--------------------+
    | book_id   | name               |
    +-----------+--------------------+
    | 1         | "Kalila And Demna" |
    | 2         | "28 Letters"       |
    | 5         | "The Hunger Games" |
    +-----------+--------------------+
    
Step-by-Step Explanation

Step 1 - Filtering Available Books:

    The line books = books.loc[books.available_from + pd.DateOffset(30) < "2019-06-23"] filters the books DataFrame to include only those books that have been available for at least one month before the given date (2019-06-23).
    It checks whether the available_from date plus 30 days is less than the specified date.
Step 2 - Aggregating Order Quantities:

    The line orders = orders.loc[orders.dispatch_date + pd.DateOffset(365) > "2019-06-23"] filters the orders DataFrame to include only those orders that were dispatched within the last year (365 days) from the given date.
    Then, it groups the remaining orders by book_id and sums the quantity sold for each book with groupby("book_id")["quantity"].sum(). This results in a Series where the index is book_id and the values are the total quantities sold.
Step 3 - Merging and Filtering:

    The line books.merge(orders, on="book_id", how="left") merges the filtered books DataFrame with the aggregated orders DataFrame based on the book_id.
    The how="left" parameter ensures that all books are retained, even those without any sales (these will have NaN for the quantity).
    fillna(0) replaces NaN values in the resulting DataFrame with 0, indicating that those books sold no copies.
    Finally, .query("quantity < 10") filters the merged DataFrame to keep only those books where the total quantity sold is less than 10.
    The final output only selects the book_id and name columns using [['book_id', 'name']].

Complexity Analysis

Time Complexity:

    Step 1: The filtering of books takes O(m), where m is the number of rows in the books DataFrame.
    Step 2: The filtering of orders also takes  O(n), where n is the number of rows in the orders DataFrame. The grouping and summing operation takes O(n) as well.
    Step 3: The merging operation is typically  O(m+k), where k is the number of rows in the orders after aggregation. The query operation is also O(m+k) in the worst case.
    Overall Time Complexity: Therefore, the overall time complexity can be approximated as  O(m+n), where m is the number of books and  n is the number of orders.

Space Complexity:

    The space complexity mainly depends on the storage of the filtered DataFrames and the merged result. Thus, it can be considered O(m+k), where k is the number of unique book_ids in the orders DataFrame after grouping. This is a constant space requirement.
    
Edge Cases to Consider

    No Sales: Books that have never been sold will still be included in the result if they meet the availability condition.
    All Sold Books: If all books have sold 10 or more copies, the function should return an empty DataFrame.
    Books Available Less Than a Month: If all books have been available for less than a month, the function should return an empty DataFrame.
    Missing Values: If there are NaN values in the available_from or dispatch_date, this could affect filtering and should be handled or considered in pre-processing steps.

In [None]:
def unpopular_books(books: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Filter books available for at least one month before the given date
    books = books.loc[books.available_from + pd.DateOffset(30) < "2019-06-23"]
    
    # Step 2: Aggregate orders to get the total quantity sold per book
    orders = orders.loc[orders.dispatch_date + pd.DateOffset(365) > "2019-06-23"]\
                   .groupby("book_id")["quantity"].sum()
    
    # Step 3: Merge books with order quantities, filling NaN with 0
    return books.merge(orders, on="book_id", how="left")\
                 .fillna(0)\
                 .query("quantity < 10")[["book_id", "name"]]


# Account Balance
 
Table: Transactions

    +-------------+------+
    | Column Name | Type |
    +-------------+------+
    | account_id  | int  |
    | day         | date |
    | type        | ENUM |
    | amount      | int  |
    +-------------+------+
    (account_id, day) is the primary key (combination of columns with unique values) for this table.
    Each row contains information about one transaction, including the transaction type, the day it occurred on, and the amount.
    type is an ENUM (category) of the type ('Deposit','Withdraw') 


    Write a solution to report the balance of each user after each transaction. You may assume that the balance of each account before any transaction is 0 and that the balance will never be below 0 at any moment.

    Return the result table in ascending order by account_id, then by day in case of a tie.

    The result format is in the following example.

 

Example 1:

Input: 
Transactions table:

    +------------+------------+----------+--------+
    | account_id | day        | type     | amount |
    +------------+------------+----------+--------+
    | 1          | 2021-11-07 | Deposit  | 2000   |
    | 1          | 2021-11-09 | Withdraw | 1000   |
    | 1          | 2021-11-11 | Deposit  | 3000   |
    | 2          | 2021-12-07 | Deposit  | 7000   |
    | 2          | 2021-12-12 | Withdraw | 7000   |
    +------------+------------+----------+--------+
Output: 

    +------------+------------+---------+
    | account_id | day        | balance |
    +------------+------------+---------+
    | 1          | 2021-11-07 | 2000    |
    | 1          | 2021-11-09 | 1000    |
    | 1          | 2021-11-11 | 4000    |
    | 2          | 2021-12-07 | 7000    |
    | 2          | 2021-12-12 | 0       |
    +------------+------------+---------+
Explanation: 

    Account 1:
    - Initial balance is 0.
    - 2021-11-07 --> deposit 2000. Balance is 0 + 2000 = 2000.
    - 2021-11-09 --> withdraw 1000. Balance is 2000 - 1000 = 1000.
    - 2021-11-11 --> deposit 3000. Balance is 1000 + 3000 = 4000.
    Account 2:
    - Initial balance is 0.
    - 2021-12-07 --> deposit 7000. Balance is 0 + 7000 = 7000.
    - 2021-12-12 --> withdraw 7000. Balance is 7000 - 7000 = 0.
    
Initial Thoughts

    The problem requires calculating the balance for each account after each transaction, starting from an initial balance of 0. The transactions can either be deposits or withdrawals, and we need to ensure that the balance does not drop below 0 at any point. The results should be organized by account_id and day in ascending order.

Explanation of the Steps

Initialization:

    The function takes a DataFrame transactions as input, which contains the transaction details (account ID, day, type, and amount).
Calculating Initial Balances:

    A new column balance is created using np.where().
    If the type is "Deposit", the amount is added to the balance; if it’s "Withdraw", the amount is negated (multiplied by -1).
    This results in a balance column that reflects the net effect of each transaction.
Sorting Transactions:

    The DataFrame is sorted by account_id and day to ensure that transactions are processed chronologically for each account.
Cumulative Sum:

    The cumulative balance for each account is calculated using groupby() and cumsum(), which provides the running total of the balance for each account_id.
Final Output:

    The function returns a DataFrame containing account_id, day, and the computed balance, sorted by account_id and day.
    
Code Explanation

    Importing Libraries: The code imports pandas for data manipulation and numpy for efficient numerical operations.

    Creating Balance Column: The np.where() function creates the initial balance column, converting withdrawals to negative values. This simplifies balance calculations in the next steps.

    Sorting: The sort_values() method arranges the transactions chronologically, which is crucial for accurate cumulative balance calculations.

    Cumulative Sum: The groupby() and cumsum() methods calculate the running total for each account, effectively simulating the balance after each transaction.

    Final DataFrame: The output is limited to the relevant columns and sorted to match the required format.

Edge Cases

    No Transactions: If the input DataFrame is empty, the output will also be an empty DataFrame.
    Only Withdrawals: If an account has only withdrawals, the function will still calculate balances correctly, starting from 0.
    Multiple Accounts with No Activity: If transactions are present for only some accounts, the function will accurately reflect the balances for those accounts without errors.
    Transactions on the Same Day: The current implementation handles multiple transactions on the same day by calculating cumulative balances, ensuring correctness.
    
Complexity Analysis

    Time Complexity: The time complexity is O(nlogn) primarily due to the sorting step, where n is the number of transactions.
    Space Complexity: The space complexity is  O(n) due to the storage of intermediate and final DataFrame results.

Follow-up Questions and Answers

Q: How would you modify the function if the balance could go negative?

    A: You would remove the check for withdrawals being negative and simply add the amounts without ensuring the balance stays above 0.
Q: What if the amount is negative?

    A: The current implementation assumes valid input, where amounts are non-negative for deposits and withdrawals; validation checks could be added if needed.
Q: Can you optimize this function further?

    A: The current approach is efficient, but if many transactions are involved, preprocessing or summarizing data before this step could enhance performance.
Q: How do you handle simultaneous transactions on the same day?

    A: The function processes transactions in the order they are provided, so any simultaneous transactions are handled based on their input order, ensuring the correct cumulative balances.

# Queries Quality and Percentage
 
Table: Queries

    +-------------+---------+
    | Column Name | Type    |
    +-------------+---------+
    | query_name  | varchar |
    | result      | varchar |
    | position    | int     |
    | rating      | int     |
    +-------------+---------+
    This table may have duplicate rows.
    This table contains information collected from some queries on a database.
    The position column has a value from 1 to 500.
    The rating column has a value from 1 to 5. Query with rating less than 3 is a poor query.


    We define query quality as:

    The average of the ratio between query rating and its position.

    We also define poor query percentage as:

    The percentage of all queries with rating less than 3.

    Write a solution to find each query_name, the quality and poor_query_percentage.

    Both quality and poor_query_percentage should be rounded to 2 decimal places.

    Return the result table in any order.

    The result format is in the following example.



Example 1:

Input: 
Queries table:

    +------------+-------------------+----------+--------+
    | query_name | result            | position | rating |
    +------------+-------------------+----------+--------+
    | Dog        | Golden Retriever  | 1        | 5      |
    | Dog        | German Shepherd   | 2        | 5      |
    | Dog        | Mule              | 200      | 1      |
    | Cat        | Shirazi           | 5        | 2      |
    | Cat        | Siamese           | 3        | 3      |
    | Cat        | Sphynx            | 7        | 4      |
    +------------+-------------------+----------+--------+
Output: 

    +------------+---------+-----------------------+
    | query_name | quality | poor_query_percentage |
    +------------+---------+-----------------------+
    | Dog        | 2.50    | 33.33                 |
    | Cat        | 0.66    | 33.33                 |
    +------------+---------+-----------------------+
Explanation: 

    Dog queries quality is ((5 / 1) + (5 / 2) + (1 / 200)) / 3 = 2.50
    Dog queries poor_ query_percentage is (1 / 3) * 100 = 33.33

    Cat queries quality equals ((2 / 5) + (3 / 3) + (4 / 7)) / 3 = 0.66
    Cat queries poor_ query_percentage is (1 / 3) * 100 = 33.33
    
Explanation of the Code

    Calculate Quality Ratio:

        queries['quality'] creates a column with the rating / position ratio for each query, defining the quality score for each entry.
    Calculate Poor Query Percentage:

        queries['poor_query_percentage'] assigns a value of 100 to entries where rating < 3 (indicating poor queries). This results in 100 for poor queries and 0 otherwise.
    Group by Query Name:

        The groupby('query_name')[['quality', 'poor_query_percentage']].mean() calculates the mean of quality and poor_query_percentage for each query_name. The apply(lambda x: round(x + 1e-9, 2)) ensures the values are rounded to two decimal places, addressing any floating-point imprecision.
    Return the Results:

        reset_index() converts the grouped results back to a DataFrame format with query_name as a column.

In [3]:
import pandas as pd

def queries_stats(queries: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Calculate the quality ratio for each query
    queries['quality'] = queries.rating / queries.position

    # Step 2: Calculate poor query percentage
    queries['poor_query_percentage'] = (queries.rating < 3) * 100

    # Step 3: Group by query_name and compute the mean for quality and poor query percentage
    # Then, round the results to 2 decimal places
    return queries.groupby('query_name')[['quality', 'poor_query_percentage']]\
                  .mean().apply(lambda x: round(x + 1e-9, 2)).reset_index()

data = {
    "query_name": ["Dog", "Dog", "Dog", "Cat", "Cat", "Cat"],
    "result": ["Golden Retriever", "German Shepherd", "Mule", "Shirazi", "Siamese", "Sphynx"],
    "position": [1, 2, 200, 5, 3, 7],
    "rating": [5, 5, 1, 2, 3, 4]
}
queries = pd.DataFrame(data)
print(queries_stats(queries))


  query_name  quality  poor_query_percentage
0        Cat     0.66                  33.33
1        Dog     2.50                  33.33
