# Exercises

<a name="anchorMario" style="position:absolute;"></a>
<hr style="border:2px solid">

# Mario Kart Challenge
<hr style="border-top:1px dashed">

In the previous notebook, you completed a task to simulate the red and blue shells in Mario Kart. Your task now is to develop the code you have written to include **functions**, **error handling**, and anything else you have learned since to improve the code you have previously written.

CHALLENGES:

- Modify the code so that after a shell is launched, the game doesn't immediately end.
  - The user should be able to launch multiple shells in a loop.
  - Add an option to quit the game.

- Show the player positions more clearly.
  - E.g. print player rankings.

In [None]:
# Develop your code here (use as many cells as you wish)


Choose a shell to launch (red/blue): red
Enter the position of the player launching the shell (1 = Mario, 2 = Luigi, 3 = Peach, 4 = Yoshi, 5 = Bowser, 6 = Toad): 3
The Red Shell hits Luigi!
Updated Player Positions: ['Mario', 'Peach', 'Luigi', 'Yoshi', 'Bowser', 'Toad']


<details>
    <summary style="color:green;font-weight:bold">Click here for a starting point </summary>
    
>

```
# Ask the user which shell to launch
shell_choice = input("Choose a shell to launch (red/blue): ").lower()

# Ask the user which player will launch the shell
player_position = int(input(f"Enter the position of the player launching the shell (1 = {players[0]}, 2 = {players[1]}, 3 = {players[2]}, 4 = {players[3]}, 5 = {players[4]}, 6 = {players[5]}): ")) - 1

# Validate the user input for the player's position
if player_position < 0 or player_position >= len(players):
    print("Invalid player position.")
else:
    if shell_choice == "red":
        # Red Shell: Targets the player ahead
        if player_position < len(players) - 1:
            target_red_shell = player_position - 1
            print(f"The Red Shell hits {players[target_red_shell]}!")
            # Move the target player back one position (index increases)
            players[target_red_shell], players[player_position] = players[player_position], players[target_red_shell]
        else:
            print("No player ahead to target!")

    elif shell_choice == "blue":
        # Blue Shell: Always targets the player in 1st place
        print(f"The Blue Shell hits {players[0]}!")
        # Move the player in 1st place to 4th place
        players.insert(3, players.pop(0))

    else:
        print("Invalid shell choice.")

    # Print the updated player positions after the shell is launched
    print("Updated Player Positions:", players)

```
</details>

<details>
    <summary style="color:green;font-weight:bold">Click here for example code including functions and error-handling</summary>
    
>

```
# List of players in the race
players = ["Mario", "Luigi", "Peach", "Yoshi", "Bowser", "Toad"]

def get_shell_choice():
    """Prompt the user to choose a shell and validate input."""
    while True:
        shell = input("Choose a shell to launch (red/blue): ").lower()
        if shell in ["red", "blue"]:
            return shell
        else:
            print("Invalid choice. Please enter 'red' or 'blue'.")

def get_player_position():
    """Prompt the user to choose a player position and validate input."""
    while True:
        try:
            player_input = int(input(
                f"Enter the position of the player launching the shell "
                f"(1 = {players[0]}, 2 = {players[1]}, 3 = {players[2]}, "
                f"4 = {players[3]}, 5 = {players[4]}, 6 = {players[5]}): ")) - 1
            
            if 0 <= player_input < len(players):
                return player_input
            else:
                print(f"Invalid position. Choose a number between 1 and {len(players)}.")
        except ValueError:
            print("Invalid input. Please enter a number.")

def launch_red_shell(player_position):
    """Handle the logic for launching a red shell."""
    if player_position < len(players) - 1:
        target_red_shell = player_position - 1
        print(f"The Red Shell hits {players[target_red_shell]}!")
        # Move the target player back one position
        players[target_red_shell], players[player_position] = players[player_position], players[target_red_shell]
    else:
        print("No player ahead to target!")

def launch_blue_shell():
    """Handle the logic for launching a blue shell."""
    print(f"The Blue Shell hits {players[0]}!")
    # Move the player in 1st place to 4th place (index 3)
    players.insert(3, players.pop(0))


# Main game loop
shell_choice = get_shell_choice()

# Launch the chosen shell
if shell_choice == "red":
    player_position = get_player_position()
    launch_red_shell(player_position)
elif shell_choice == "blue":
    launch_blue_shell()

# Print updated player positions after the shell is launched
print("Updated Player Positions:", players)

```
</details>

<details>
    <summary style="color:green;font-weight:bold">Click here for example code including challenges</summary>
    
>

```
# List of players in the race
players = ["Mario", "Luigi", "Peach", "Yoshi", "Bowser", "Toad"]

def display_player_positions():
    """Display player rankings in a readable format."""
    print("\nüèÅ Current Player Positions üèÅ")
    for i, player in enumerate(players, start=1):
        position = f"{i}{'st' if i==1 else 'nd' if i==2 else 'rd' if i==3 else 'th'}"
        print(f"{position}: {player}")
    print("-" * 30)  # Separator for readability

def get_shell_choice():
    """Prompt the user to choose a shell and validate input."""
    while True:
        shell = input("Choose a shell to launch (red/blue) or type 'quit' to exit: ").lower()
        if shell in ["red", "blue", "quit"]:
            return shell
        else:
            print("Invalid choice. Please enter 'red', 'blue', or 'quit'.")

def get_player_position():
    """Prompt the user to choose a player position and validate input."""
    while True:
        try:
            player_input = int(input(
                f"Enter the position of the player launching the shell (1-{len(players)}): ")) - 1
            
            if 0 <= player_input < len(players):
                return player_input
            else:
                print(f"Invalid position. Choose a number between 1 and {len(players)}.")
        except ValueError:
            print("Invalid input. Please enter a number.")

def launch_red_shell(player_position):
    """Handle the logic for launching a red shell."""
    if player_position < len(players) - 1:
        target_red_shell = player_position - 1
        print(f"\nüéØ The Red Shell hits {players[target_red_shell]}!")
        # Move the target player back one position
        players[target_red_shell], players[player_position] = players[player_position], players[target_red_shell]
    else:
        print("\n‚ùå No player ahead to target!")

def launch_blue_shell():
    """Handle the logic for launching a blue shell."""
    print(f"\nüí• The Blue Shell hits {players[0]}!")
    if len(players) > 4:
        # Move the player in 1st place to 4th place (index 3) if there are more than 4 players
        players.insert(3, players.pop(0))
    else:
        # If there are 4 or fewer players, move the player in 1st place to the last position
        players.append(players.pop(0))

# Main game loop
while True:
    display_player_positions()  # Show updated player positions
    
    shell_choice = get_shell_choice()
    if shell_choice == "quit":
        print("\nüèÅ Race Over! Thanks for playing! üèÅ")
        break  # Exit the game loop

    if shell_choice == "red":
        player_position = get_player_position()
        launch_red_shell(player_position)
    elif shell_choice == "blue":
        launch_blue_shell()
    
    print("\nüîÑ Updating player positions...")

```
</details>

### __Functions Exercise - Outlier detection__

Outlier detection and removal is crucial in data analysis because:

- __Statistical accuracy:__ Outliers can significantly skew statistical measures like means, variances, and correlations, leading to misrepresentation of the underlying data distribution.

- __Model performance:__ Machine learning models are highly sensitive to outliers, which can introduce bias, reduce predictive accuracy, and create unstable models that don't generalize well to new data.

- __Data quality assurance:__ Identifying outliers helps detect measurement errors, data entry mistakes, or equipment malfunctions that might otherwise go unnoticed.

- __Business insights:__ Removing noise from data enables clearer visualization and more reliable insights for decision-making.

- __Resource efficiency:__ Eliminating outliers can improve computational efficiency by reducing the complexity needed to fit models to unusual data points.

<br>

__For this exercise, create a function that:__

- Takes a numeric list as input.

- Identifies and removes outliers, defined as values more than 3 standard deviations from the mean.

- __Returns:__

    - A cleaned version of the list.

    - The number of outliers found.

- Follows Docstring best practices. 

__Test it with the following values:__

values = [10, 12, 13, 14, 15, 16, 100, 17, 11, 13, 1, 1, 1, 1]

__Expected outcome:__

[10, 12, 13, 14, 15, 16, 17, 11, 13, 1, 1, 1, 1]

References: 
- https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule
- https://bookdown.org/pkaldunn/Book/identifying-outliers.html

In [None]:
# Exercise: Outlier detection
# Solution:
import numpy as np


[10, 12, 13, 14, 15, 16, 17, 11, 13, 1, 1, 1, 1]

<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
```
import numpy as np

def remove_outliers(data, threshold=3):
    """
    Removes outliers from a list of numeric values. Outliers are values more than
    `threshold` standard deviations from the mean.

    Args:
        data: List of numeric values.
        threshold: Number of standard deviations to define an outlier.

    Returns:
        List with outliers removed.
    """
    np_data = np.array(data)

    if not np.issubdtype(np_data.dtype, np.number):
        raise ValueError("Input list must contain only numeric values.")

    mean = np.mean(np_data)
    std = np.std(np_data)

    lower_bound = mean - threshold * std
    upper_bound = mean + threshold * std

    filtered = np_data[(np_data >= lower_bound) & (np_data <= upper_bound)]

    return filtered.tolist()

values = [10, 12, 13, 14, 15, 16, 100, 17, 11, 13, 1, 1, 1, 1]
remove_outliers(values)
```
</details>

### __Challenge : Create a `DatasetAnalyzer` Class__

Design a class called `DatasetAnalyzer` that takes a Pandas DataFrame as input during initialisation. This class should have methods to:

- `get_column_statistics(self, column_name)`: Returns descriptive statistics for a specified column.  
  This method should handle both numeric and non-numeric columns gracefully (using `pd.api.types`).  
  - For numeric columns, it returns descriptive statistics (mean, std, etc.).  
  - For non-numeric columns, it returns value counts.  
  - If the column doesn't exist, it raises an informative `KeyError`.

- `find_missing_values(self)`: Returns a DataFrame showing the count and percentage of missing values for each column.

- `suggest_columns_for_analysis(self)`: This method analyzes the DataFrame and suggests which columns are suitable for numerical analysis (i.e., numeric data types) and which are not (non-numeric data types). It prints a descriptive summary for each category, explaining the types of analytics that can be performed.

- The class should also validate if the input is a dataframe and have some error handling if the column does not exist.

Afterwards, you will perform tasks to test your `DatasetAnalyzer` class.

In [None]:
# Challenge: Create a DataAnalyzer Class
# Your code:

<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
```
import pandas as pd
import numpy as np

class DatasetAnalyzer:
    """
    A utility class for performing basic analysis on a Pandas DataFrame.

    Attributes:
        df (pd.DataFrame): The DataFrame to be analyzed.
    """
    def __init__(self, dataframe):
        """
        Initializes the DatasetAnalyzer with a Pandas DataFrame.

        Args:
            dataframe (pd.DataFrame): The DataFrame to analyze.

        Raises:
            TypeError: If the input is not a Pandas DataFrame.
        """
        if not isinstance(dataframe, pd.DataFrame):
            raise TypeError("Input must be a Pandas DataFrame. Please provide a valid DataFrame.")
        self.df = dataframe
        print("DatasetAnalyzer initialized successfully with the provided DataFrame.")


    def get_column_statistics(self, column_name):
        """
        Returns descriptive statistics or value counts for a specified column.

        For numeric columns, returns descriptive statistics (mean, std, min, max, quartiles).
        For non-numeric (categorical or object) columns, returns value counts.

        Args:
            column_name (str): The name of the column to analyze.

        Returns:
            pd.Series: Descriptive statistics for the column (if numeric),
                       or value counts (if non-numeric).

        Raises:
            KeyError: If the column name is not found in the DataFrame.
            ValueError: If the column is non-numeric but an attempt is made
                        to perform numeric-only statistics. (This is handled
                        by checking dtype before calling describe()).
        """
        if column_name not in self.df.columns:
            raise KeyError(f"Error: Column '{column_name}' not found in the DataFrame. Please check the column name.")

        col_data = self.df[column_name]

        # Using try-except block to gracefully handle potential issues,
        # though pd.api.types.is_numeric_dtype handles most type checks.
        try:
            if pd.api.types.is_numeric_dtype(col_data):
                print(f"\n--- Statistical Summary for Numeric Column: '{column_name}' ---")
                print("This column can be analyzed using measures like mean, median, standard deviation, min, max, and quartiles.")
                return col_data.describe()
            else:
                print(f"\n--- Value Counts for Categorical/Non-Numeric Column: '{column_name}' ---")
                print("This column is best analyzed by counting the occurrences of each unique value.")
                return col_data.value_counts()
        except Exception as e:
            # Catch any unexpected errors during analysis
            print(f"An unexpected error occurred while analyzing column '{column_name}': {e}")
            return None # Or re-raise specific error if appropriate

    def find_missing_values(self):
        """
        Identifies and quantifies missing values in the DataFrame.

        Returns:
            pd.DataFrame: A DataFrame showing the count and percentage of
                          missing values for each column that has missing data.
                          Returns an empty DataFrame if no missing values are found.
        """
        print("\n--- Missing Values Report ---")
        missing_counts = self.df.isnull().sum()
        missing_percentages = (self.df.isnull().sum() / len(self.df)) * 100
        missing_df = pd.DataFrame({
            'Missing Count': missing_counts,
            'Missing Percentage (%)': missing_percentages
        })
        # Filter to show only columns with missing values
        filtered_missing_df = missing_df[missing_df['Missing Count'] > 0]

        if filtered_missing_df.empty:
            print("No missing values found in the DataFrame.")
        else:
            print("Below are the columns with missing values and their respective counts and percentages:")
        return filtered_missing_df

    def suggest_columns_for_analysis(self):
        """
        Analyzes the DataFrame and provides descriptive suggestions on column types
        and appropriate statistical analyses.
        """
        numeric_cols = []
        non_numeric_cols = []

        for col_name in self.df.columns:
            if pd.api.types.is_numeric_dtype(self.df[col_name]):
                numeric_cols.append(col_name)
            else:
                non_numeric_cols.append(col_name)

        print("\n--- Column Analysis Suggestions ---")
        print("Based on data types, here's how you can typically analyze your columns:\n")

        if numeric_cols:
            print("Numerical Columns (Quantitative Data):")
            print("Can be analyzed using measures like mean, median, mode, standard deviation, variance, min, max, range, and quartiles. Histograms, box plots, and scatter plots are useful visualizations.")
            print("---------------------------------------")
            for col in numeric_cols:
                print(col)
                #print(f"- '{col}': Can be analyzed using measures like mean, median, mode, standard deviation, variance, min, max, range, and quartiles. Histograms, box plots, and scatter plots are useful visualizations.")
            print("\nThese columns are suitable for mathematical operations, statistical modeling, and regression tasks.")
        else:
            print("No numerical columns found in the DataFrame.")

        print("\n") # Add a line break for readability

        if non_numeric_cols:
            print("Categorical/Non-Numerical Columns (Qualitative Data):")
            print("Primarily analyzed by counting frequencies of unique values (value counts), finding modes, and checking for diversity. Bar charts and pie charts are useful visualizations.")
            print("-----------------------------------------------------")
            for col in non_numeric_cols:
                print(col)
            print("\nThese columns are typically used for grouping, filtering, and understanding distributions of categories. They might require encoding for machine learning models.")
        else:
            print("No categorical/non-numerical columns found in the DataFrame.")
```
</details>

#### __Testing__ 

Now you will test your `DatasetAnalyzer` class. First, you will test your class with a sample DataFrame. Then, you will test your class using the Churn dataset.

__Creating a test DataFrame for analysis__

You will be creating a test DataFrame for analysis using your `DatasetAnalyzer` class. In this dataset, it includes numerical columns (quantitative data) with and without missing values, and categorical columns (qualitative data).

In [None]:
# Columns for analysis:
data = {'Numeric_Feature1': [10, 20, np.nan, 40, 50, 60],
        'Numeric_Feature2': [1.5, 2.3, 3.1, 4.0, np.nan, 6.7],
        'Category_A': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Red'],
        'Category_B': ['Small', 'Medium', 'Large', 'Small', 'Medium', 'Medium'],
        'ID_Column': ['id_001', 'id_002', 'id_003', 'id_004', 'id_005', 'id_006'],
        'No_Missing': [1, 2, 3, 4, 5, 6]} # Column with no missing values
test_df = pd.DataFrame(data)
test_df_analyser = DatasetAnalyzer(test_df)
test_df_analyser.suggest_columns_for_analysis()

DatasetAnalyzer initialized successfully with the provided DataFrame.

--- Column Analysis Suggestions ---
Based on data types, here's how you can typically analyze your columns:

Numerical Columns (Quantitative Data):
Can be analyzed using measures like mean, median, mode, standard deviation, variance, min, max, range, and quartiles. Histograms, box plots, and scatter plots are useful visualizations.
---------------------------------------
Numeric_Feature1
Numeric_Feature2
No_Missing

These columns are suitable for mathematical operations, statistical modeling, and regression tasks.


Categorical/Non-Numerical Columns (Qualitative Data):
Primarily analyzed by counting frequencies of unique values (value counts), finding modes, and checking for diversity. Bar charts and pie charts are useful visualizations.
-----------------------------------------------------
Category_A
Category_B
ID_Column

These columns are typically used for grouping, filtering, and understanding distributions of c

__Missing Values__

Create a report about the missing values in your DataFrame.



In [None]:
# Missing values:
test_df_analyser.find_missing_values()


--- Missing Values Report ---
Below are the columns with missing values and their respective counts and percentages:


Unnamed: 0,Missing Count,Missing Percentage (%)
Numeric_Feature1,1,16.666667
Numeric_Feature2,1,16.666667


<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
```
test_df_analyser.find_missing_values()
```
</details>

__Statistics on Numeric Features__

Get a statistical summary of your numerical columns.

In [None]:
# Statistics on Numeric_Feature2:
test_df_analyser.get_column_statistics("Numeric_Feature2")


--- Statistical Summary for Numeric Column: 'Numeric_Feature2' ---
This column can be analyzed using measures like mean, median, standard deviation, min, max, and quartiles.


count    5.000000
mean     3.520000
std      2.005492
min      1.500000
25%      2.300000
50%      3.100000
75%      4.000000
max      6.700000
Name: Numeric_Feature2, dtype: float64

<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
```
test_df_analyser.get_column_statistics("Numeric_Feature2")
```
</details>

__Statistics on Categorical Features__

Analyse the categorical columns.

In [None]:
# Statistics on Category_A:
test_df_analyser.get_column_statistics("Category_A")


--- Value Counts for Categorical/Non-Numeric Column: 'Category_A' ---
This column is best analyzed by counting the occurrences of each unique value.


Category_A
Red      3
Blue     2
Green    1
Name: count, dtype: int64

<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
```
test_df_analyser.get_column_statistics("Category_A")
```
</details>

__Testing with churn data__

You will now be testing your class with a real dataset. Test your class with the Churn data which you can find in the "Data" folder. First, read the Churn dataset and then use your class to get analysing.

In [None]:
# Test your class with a real dataset: 
churn_data = pd.read_csv("Data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
churn_data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
churn_data_analyser = DatasetAnalyzer(churn_data)

DatasetAnalyzer initialized successfully with the provided DataFrame.


In [None]:
churn_data_analyser.suggest_columns_for_analysis()


--- Column Analysis Suggestions ---
Based on data types, here's how you can typically analyze your columns:

Numerical Columns (Quantitative Data):
Can be analyzed using measures like mean, median, mode, standard deviation, variance, min, max, range, and quartiles. Histograms, box plots, and scatter plots are useful visualizations.
---------------------------------------
SeniorCitizen
tenure
MonthlyCharges

These columns are suitable for mathematical operations, statistical modeling, and regression tasks.


Categorical/Non-Numerical Columns (Qualitative Data):
Primarily analyzed by counting frequencies of unique values (value counts), finding modes, and checking for diversity. Bar charts and pie charts are useful visualizations.
-----------------------------------------------------
customerID
gender
Partner
Dependents
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod
T

In [None]:
# Missing values:
churn_data_analyser.find_missing_values()


--- Missing Values Report ---
No missing values found in the DataFrame.


Unnamed: 0,Missing Count,Missing Percentage (%)


In [None]:
# Statistics on numerical columns:
churn_data_analyser.get_column_statistics("MonthlyCharges")


--- Statistical Summary for Numeric Column: 'MonthlyCharges' ---
This column can be analyzed using measures like mean, median, standard deviation, min, max, and quartiles.


count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64

In [None]:
# Statistics on categorical columns:
churn_data_analyser.get_column_statistics("gender")


--- Value Counts for Categorical/Non-Numeric Column: 'gender' ---
This column is best analyzed by counting the occurrences of each unique value.


gender
Male      3555
Female    3488
Name: count, dtype: int64

### __Recursion Exercise: Organising sales data__

You're a junior data scientist at a retail company. The company collects sales data from different regions. Each region may have multiple stores, and each store stores its data in its own CSV file, organised into folders like this:


recursion_exercise_data/

‚îú‚îÄ‚îÄ region_east/

‚îÇ   ‚îú‚îÄ‚îÄ store_001.csv

‚îÇ   ‚îî‚îÄ‚îÄ store_002.csv

‚îú‚îÄ‚îÄ region_west/

‚îÇ   ‚îú‚îÄ‚îÄ store_003.csv

‚îÇ   ‚îî‚îÄ‚îÄ subregion/

‚îÇ       ‚îî‚îÄ‚îÄ store_004.csv

‚îî‚îÄ‚îÄ store_head_office.csv

Your task is to recursively traverse this structure, load all the CSV files, and combine them into a single DataFrame for analysis.

You‚Äôre expected to:

- Find all CSV files under the data directory.

- Load them as Pandas DataFrames.

- Combine them into one master_df.

- Since you don‚Äôt know how many subfolders might exist, a recursive approach is ideal.

- Display total sales per store.


Expected output: 

- Dataframe: 

| store_id | sales |     date     |                         source_file                          |
|----------|-------|--------------|--------------------------------------------------------------|
|    1     | 3191  | 2024-01-01   | recursion_exercise_data\region_east\store_001.csv           |
|    1     | 4772  | 2024-01-02   | recursion_exercise_data\region_east\store_001.csv           |
|    1     |  936  | 2024-01-03   | recursion_exercise_data\region_east\store_001.csv           |
|    1     | 3070  | 2024-01-04   | recursion_exercise_data\region_east\store_001.csv           |
|    1     | 1036  | 2024-01-05   | recursion_exercise_data\region_east\store_001.csv           |

- Sales per store: 

| store_id | sales |
|----------|--------|
|    0     | 19255  |
|    4     | 18426  |
|    5     | 17812  |
|    2     | 17278  |
|    6     | 16347  |
|    7     | 14963  |
|    3     | 14118  |
|    1     | 13005  |

First, you will need to create sample sales data, create new folders to organise the data in and save your sample data in their folders.

Now, complete the exercise using a recursive approach to create the `master_df` and display the total sales per store.

In [None]:
# Your answer here:


  from pandas.core import (


Unnamed: 0,store_id,sales,date,source_file
0,1,3845,2024-01-01,recursion_exercise_data\region_east\store_001.csv
1,1,4142,2024-01-02,recursion_exercise_data\region_east\store_001.csv
2,1,1974,2024-01-03,recursion_exercise_data\region_east\store_001.csv
3,1,3720,2024-01-04,recursion_exercise_data\region_east\store_001.csv
4,1,1867,2024-01-05,recursion_exercise_data\region_east\store_001.csv


In [None]:
# Total sales per store
master_df.groupby("store_id")["sales"].sum().sort_values(ascending=False).reset_index()

Unnamed: 0,store_id,sales
0,6,17828
1,7,17519
2,1,15548
3,4,14443
4,5,14215
5,0,12617
6,2,12398
7,3,11838


<details>
    <summary style="color:green;font-weight:bold">Click here for the solution</summary>
    
>
__Obtaining the `master_df` DataFrame:__

```
def load_csvs_recursively(folder_path):
    """
    Recursively finds and loads all CSV files from a folder and its subfolders.

    Args:
        folder_path: The root directory to begin the search.

    Returns:
        A Pandas DataFrame containing the concatenated data from all found CSV files.
    """
    combined_data = []

    for entry in os.scandir(folder_path):
        if entry.is_file() and entry.name.endswith(".csv"):
            df = pd.read_csv(entry.path)
            df['source_file'] = entry.path  # Add file source column
            combined_data.append(df)
        elif entry.is_dir():
            nested_df = load_csvs_recursively(entry.path)
            combined_data.append(nested_df)

    return pd.concat(combined_data, ignore_index=True) if combined_data else pd.DataFrame()

# Load all data
master_df = load_csvs_recursively("recursion_exercise_data") 
master_df.head()
```
<br>

__Sales per store__
```
master_df.groupby("store_id")["sales"].sum().sort_values(ascending=False).reset_index()
```
</details>