# Monthly Transactions II
 
Table: Transactions

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | id             | int     |
    | country        | varchar |
    | state          | enum    |
    | amount         | int     |
    | trans_date     | date    |
    +----------------+---------+
    id is the column of unique values of this table.
    The table has information about incoming transactions.
    The state column is an ENUM (category) of type ["approved", "declined"].
Table: Chargebacks

    +----------------+---------+
    | Column Name    | Type    |
    +----------------+---------+
    | trans_id       | int     |
    | trans_date     | date    |
    +----------------+---------+
    Chargebacks contains basic information regarding incoming chargebacks from some transactions placed in Transactions table.
    trans_id is a foreign key (reference column) to the id column of Transactions table.
    Each chargeback corresponds to a transaction made previously even if they were not approved.


    Write a solution to find for each month and country: the number of approved transactions and their total amount, the number of chargebacks, and their total amount.

    Note: In your solution, given the month and country, ignore rows with all zeros.

    Return the result table in any order.

    The result format is in the following example.

 

Example 1:

Input: 
Transactions table:

    +-----+---------+----------+--------+------------+
    | id  | country | state    | amount | trans_date |
    +-----+---------+----------+--------+------------+
    | 101 | US      | approved | 1000   | 2019-05-18 |
    | 102 | US      | declined | 2000   | 2019-05-19 |
    | 103 | US      | approved | 3000   | 2019-06-10 |
    | 104 | US      | declined | 4000   | 2019-06-13 |
    | 105 | US      | approved | 5000   | 2019-06-15 |
    +-----+---------+----------+--------+------------+
Chargebacks table:

    +----------+------------+
    | trans_id | trans_date |
    +----------+------------+
    | 102      | 2019-05-29 |
    | 101      | 2019-06-30 |
    | 105      | 2019-09-18 |
    +----------+------------+
Output: 

    +---------+---------+----------------+-----------------+------------------+-------------------+
    | month   | country | approved_count | approved_amount | chargeback_count | chargeback_amount |
    +---------+---------+----------------+-----------------+------------------+-------------------+
    | 2019-05 | US      | 1              | 1000            | 1                | 2000              |
    | 2019-06 | US      | 2              | 8000            | 1                | 1000              |
    | 2019-09 | US      | 0              | 0               | 1                | 5000              |
    +---------+---------+----------------+-----------------+------------------+-------------------+
 
Initial Ideas

    The goal of the monthly_transactions function is to process two data tables: one for transactions and another for chargebacks. The function aims to summarize the number of approved transactions and chargebacks by month and country. The output should reflect counts and amounts for both approved transactions and chargebacks, allowing for an effective comparison.

Steps

    Data Preparation: Convert the transaction dates to a consistent format representing year and month (YYYY-MM).
    Filtering Approved Transactions: Keep only the rows where transactions have an approved status.
    Aggregation of Approved Transactions: Group by month and country, counting the number of approved transactions and summing their amounts.
    Processing Chargebacks: Merge chargebacks with transactions to include relevant transaction details, and group by month and country to aggregate chargeback data.
    Combining Results: Merge the summary of approved transactions with the chargeback summary, filling any missing values with zeros.
    Formatting the Output: Ensure the output month format is correct and return the final DataFrame.    
    
Edge Cases

    No Transactions: If the transactions DataFrame is empty, the output should only contain chargeback information if available.
    No Chargebacks: If the chargebacks DataFrame is empty, the output should reflect only approved transactions with chargeback counts and amounts set to zero.
    Same Month for Multiple Transactions: Ensure that the aggregation works correctly when multiple transactions or chargebacks occur in the same month.
    
Complexity Analysis

    Time Complexity: The function primarily involves filtering and aggregating DataFrames, resulting in a complexity of O(n log n) due to the grouping and aggregation operations, where n is the number of transactions and chargebacks.
    Space Complexity: The space complexity is O(m + k), where m is the number of unique months for transactions and chargebacks, and k is the number of countries.
    
Follow-Up Questions and Answers
Q: What would happen if there are duplicate transactions?

    A: The function assumes that transactions are unique by id. Duplicate entries could skew the results, so it’s advisable to handle duplicates before processing.
Q: How would you modify the function to handle multiple countries?

    A: The function already supports multiple countries through its grouping operations. Additional countries would be naturally included in the aggregation.
Q: How can we extend this function to include a comparison with previous months?

    A: We could add additional logic to calculate differences between months by storing the previous month’s totals in a separate DataFrame and then performing a join or calculation.

In [None]:
import pandas as pd

def monthly_transactions(transactions: pd.DataFrame, chargebacks: pd.DataFrame) -> pd.DataFrame:
    transactions['trans_date'] = pd.to_datetime(transactions['trans_date'])
    chargebacks['trans_date'] = pd.to_datetime(chargebacks['trans_date'])

    # Step 1: Format transaction dates to month format
    transactions['trans_date'] = transactions['trans_date'].dt.strftime('%y-%m')
    
    # Step 2: Merge chargebacks with transactions
    chargebacks = chargebacks.merge(transactions, left_on='trans_id', right_on='id', how='inner')[['trans_id','trans_date','country','amount']]

    # Step 3: Filter approved transactions
    transactions = transactions[transactions['state'] == 'approved']
    
    # Step 4: Group approved transactions
    result = transactions.groupby(['trans_date', 'country']).agg(
        approved_count=('state', 'count'),
        approved_amount=('amount', 'sum')
    ).reset_index().rename(columns={'trans_date': 'month'})

    # Step 5: Process chargebacks
    chargebacks['month'] = chargebacks['trans_date'].dt.strftime('%y-%m')
    chargebacks = chargebacks.groupby(['month', 'country']).agg(
        chargeback_count=('trans_id', 'count'),
        chargeback_amount=('amount', 'sum')
    ).reset_index()
    
    # Step 6: Combine results
    combine_df = result.merge(chargebacks, on=['month', 'country'], how='outer').fillna(0)
    
    # Formatting month
    combine_df['month'] = '20' + combine_df['month']
    
    return combine_df

# Example input data
transactions_data = {
    'id': [101, 102, 103, 104, 105],
    'country': ['US', 'US', 'US', 'US', 'US'],
    'state': ['approved', 'declined', 'approved', 'declined', 'approved'],
    'amount': [1000, 2000, 3000, 4000, 5000],
    'trans_date': ['2019-05-18', '2019-05-19', '2019-06-10', '2019-06-13', '2019-06-15']
}

chargebacks_data = {
    'trans_id': [102, 101, 105],
    'trans_date': ['2019-05-29', '2019-06-30', '2019-09-18']
}

# Create DataFrames
transactions_df = pd.DataFrame(transactions_data)
chargebacks_df = pd.DataFrame(chargebacks_data)

# Get monthly transactions
result_df = monthly_transactions(transactions_df, chargebacks_df)
print(result_df)


# Game Play Analysis III
 
Table: Activity

    +--------------+---------+
    | Column Name  | Type    |
    +--------------+---------+
    | player_id    | int     |
    | device_id    | int     |
    | event_date   | date    |
    | games_played | int     |
    +--------------+---------+
    (player_id, event_date) is the primary key (column with unique values) of this table.
    This table shows the activity of players of some games.
    Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.


    Write a solution to report for each player and date, how many games played so far by the player. That is, the total number of games played by the player until that date. Check the example for clarity.

    Return the result table in any order.

    The result format is in the following example.

 

Example 1:

Input: 
Activity table:

    +-----------+-----------+------------+--------------+
    | player_id | device_id | event_date | games_played |
    +-----------+-----------+------------+--------------+
    | 1         | 2         | 2016-03-01 | 5            |
    | 1         | 2         | 2016-05-02 | 6            |
    | 1         | 3         | 2017-06-25 | 1            |
    | 3         | 1         | 2016-03-02 | 0            |
    | 3         | 4         | 2018-07-03 | 5            |
    +-----------+-----------+------------+--------------+
Output: 

    +-----------+------------+---------------------+
    | player_id | event_date | games_played_so_far |
    +-----------+------------+---------------------+
    | 1         | 2016-03-01 | 5                   |
    | 1         | 2016-05-02 | 11                  |
    | 1         | 2017-06-25 | 12                  |
    | 3         | 2016-03-02 | 0                   |
    | 3         | 2018-07-03 | 5                   |
    +-----------+------------+---------------------+
Explanation: 

    For the player with id 1, 5 + 6 = 11 games played by 2016-05-02, and 5 + 6 + 1 = 12 games played by 2017-06-25.
    For the player with id 3, 0 + 5 = 5 games played by 2018-07-03.
    Note that for each player we only care about the days when the player logged in.

Explanation of the Code

    Sorting: The first step sorts the DataFrame by player_id and event_date. This is essential because we need to calculate the cumulative sum of games played in chronological order for each player.

    Cumulative Sum Calculation: We use the groupby method to group the data by player_id and then apply cumsum() on the games_played column. This calculates the cumulative number of games played by each player up to each event date.

    Selecting Relevant Columns: After computing the cumulative sum, we create a new DataFrame called result that contains only the player_id, event_date, and the newly computed games_played_so_far.

Edge Cases

    No Activity: If the activity DataFrame is empty, the function will return an empty DataFrame.
    Multiple Entries on the Same Day: If a player logs multiple entries on the same day, the cumulative sum will consider all entries for that day.
    Different Players: The implementation inherently supports multiple players, ensuring that the cumulative sums are calculated independently for each player.
    
Complexity Analysis

    Time Complexity: The sorting operation dominates the complexity, making it O(n log n), where n is the number of rows in the DataFrame. The grouping and cumulative sum operation is O(n).
    Space Complexity: The space complexity is O(n) for storing the cumulative sums in the new column.

Follow-Up Questions and Answers
Q: How would the function handle players with no games played?

    A: The function will correctly output zero for those players on their event dates.
Q: Can this function be adapted for more detailed metrics?

    A: Yes, it could be expanded to include metrics like average games played per session or total session time, requiring additional data.
Q: What would happen if games_played contained negative values?

    A: The cumulative sum would still calculate, but negative values could lead to incorrect totals. Data validation would be necessary to handle such cases.


In [1]:
import pandas as pd

def gameplay_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Sort the DataFrame by player_id and event_date
    activity = activity.sort_values(by=['player_id', 'event_date'])
    
    # Step 2: Group by player_id and calculate the cumulative sum of games_played
    activity['games_played_so_far'] = activity.groupby('player_id')['games_played'].cumsum()
    
    # Step 3: Select the relevant columns for output
    result = activity[['player_id', 'event_date', 'games_played_so_far']]
    
    return result

# Example usage
activity_data = {
    'player_id': [1, 1, 1, 3, 3],
    'device_id': [2, 2, 3, 1, 4],
    'event_date': ['2016-03-01', '2016-05-02', '2017-06-25', '2016-03-02', '2018-07-03'],
    'games_played': [5, 6, 1, 0, 5]
}

# Create DataFrame
activity_df = pd.DataFrame(activity_data)
activity_df['event_date'] = pd.to_datetime(activity_df['event_date'])

# Get the result
result_df = gameplay_analysis(activity_df)
print(result_df)


   player_id event_date  games_played_so_far
0          1 2016-03-01                    5
1          1 2016-05-02                   11
2          1 2017-06-25                   12
3          3 2016-03-02                    0
4          3 2018-07-03                    5
