# Pandas Basics: Customer Purchasing Behavior Analysis

In this tutorial, we will explore the fundamental operations and functionalities of the Pandas library using a dataset on customer purchasing behavior. 

By the end of this lesson, you will learn how to:
- load data into a DataFrame
- perform basic data exploration
- select and filter data
- execute basic data operations


This hands-on approach will help you gain practical skills in data manipulation and analysis with Pandas.

### What is a DataFrame?

A Pandas DataFrame is a two-dimensional table with labeled rows and columns that can hold different types of data and can be resized. It is similar to a table in a database, an Excel spreadsheet. Each column in a DataFrame can contain different data types (e.g., integers, floats, strings), and the DataFrame provides a variety of methods for data manipulation, analysis, and visualization.

Key features of a DataFrame include:
- **Labeled axes**: Both rows and columns have labels, which makes it easy to access and manipulate data.
- **Heterogeneous data**: Different columns can contain different types of data.
- **Size-mutable**: You can add or remove columns and rows.
- **Integrated operations**: Built-in methods for data manipulation, such as filtering, grouping, and aggregation.

For more details, you can refer to the [Pandas documentation on DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html).



Before we begin, make sure you have the **Pandas** library installed. **Pandas** is a powerful data manipulation and analysis library for Python.

After you have installed your virtual environment, `venv`, install the Pandas library using `pip`:

```bash
pip install pandas 
```


After you have installed the library, run the block below to import the library.

In [2]:
import pandas as pd

We will create all of our functions, study them, and then invoke them.
Our functions will be:

- load_data() - loads the data from the CSV file into a DataFrame
- explore_dataframe(df) - explores the DataFrame
- data_selection(df) - selects and filters the data from the DataFrame
- data_operations(df) - performs basic data operations on the DataFrame

**Note:** Try to read and interpret the code for each function `before` you read the explanation.

Before you start, let's take a look at the [data](./data/customer_data.csv) we will be working with in the `load_data()` function.

**[Customer Data CSV LINK](./data/customer_data.csv)**

Let's start with the `load_data()` function. What is it doing? After you have discussed the code, run the code block below to create the function. Again, you will only see a checkmark, however the function declaration will be stored in the `pandas_basics_lesson.ipynb` file for later invocations.


In [3]:
def load_data():
    return pd.read_csv('data/customer_data.csv')

### Function Explanation: `load_data()`

The `load_data()` function reads a CSV file named `customer_data.csv` from the `data` directory and loads it into a Pandas DataFrame. This DataFrame will contain all the customer data from the CSV file, making it ready for further analysis and manipulation using Pandas.

Let's create the `explore_dataframe()` function.

In [4]:
def explore_dataframe(df):
    """Demonstrate basic DataFrame operations."""
    print("First 5 rows:")
    print(df.head())
    
    print("\nDataFrame Info:")
    print(df.info())
    
    print("\nSummary Statistics:")
    print(df.describe())

### Function Explanation: `explore_dataframe(df)`

The `explore_dataframe(df)` function demonstrates basic DataFrame operations using the following methods:

- `df.head()`: This method returns the first 5 rows of the DataFrame by default. You can specify a different number of rows by passing an argument, e.g., `df.head(10)` to return the first 10 rows. It is useful for quickly inspecting the initial entries of the dataset.

- `df.info()`: Provides a concise summary of the DataFrame, including the index dtype and column dtypes, non-null values, and memory usage.

- `df.describe()`: Generates descriptive summary statistics for numerical columns, such as count, mean (average), std (standard deviation), min, 25%, 50%, 75%, and max.

These methods help in understanding the structure and properties of the DataFrame, making it easier to perform subsequent data analysis tasks.

Let's create the `data_selection(df)` function.


### Overview: Selecting Columns from a DataFrame

Selecting specific columns from a DataFrame is a common operation in data analysis. It allows you to focus on the relevant parts of your dataset, making it easier to perform analysis, visualization, and reporting. By selecting only the necessary columns, you can reduce memory usage and improve the performance of your data processing tasks. Additionally, selecting columns can help in cleaning and preparing data by isolating the features of interest.



In [5]:
def data_selection(df):
    """Demonstrate data selection and indexing."""
    print("Selecting 'age' and 'annual_income' columns:")
    print(df[['age', 'annual_income']].head())
    
    print("\nFiltering customers with loyalty_score > 75:")
    print(df[df['loyalty_score'] > 75].head())
    
    print("\nUsing loc to select specific rows and columns:")
    print(df.loc[0:4, ['user_id', 'age', 'purchase_amount']])

### Function Explanation: `data_selection(df)`

The `data_selection(df)` function demonstrates various techniques for selecting and filtering data within a DataFrame. Here is a breakdown of what each part of the function does:

- `df[['age', 'annual_income']]`: This line selects the 'age' and 'annual_income' columns from the DataFrame and returns a new DataFrame containing only these columns. The `head()` method is then used to display the first 5 rows of this new DataFrame.

- `df[df['loyalty_score'] > 75]`: This line filters the DataFrame to include only the rows where the 'loyalty_score' column value is greater than 75. The `head()` method is used again to display the first 5 rows of this filtered DataFrame.

- `df.loc[0:4, ['user_id', 'age', 'purchase_amount']]`: This line uses the `loc` indexer to select rows 0 to 4, inclusive, and the specified columns from the DataFrame. The `loc` indexer allows for both row and column selection using labels. The `head()` method is used to display the first 5 rows of this selected DataFrame.

How does `loc` work? Do some research on the internet, in the documentation or using AI.

These examples showcase how to select specific columns, filter rows based on conditions, and use label-based indexing to extract desired subsets of data from a DataFrame.

In [6]:
def data_operations(df):
    """Demonstrate basic data operations."""
    df['average_purchase'] = df['purchase_amount'] / df['purchase_frequency']
    print("Added 'average_purchase' column:")
    print(df.head())
    
    print("\nSorting by purchase_amount (descending):")
    print(df.sort_values('purchase_amount', ascending=False).head())

### Function Explanation: `data_operations(df)`

The `data_operations(df)` function demonstrates basic data operations on a DataFrame. Here is a breakdown of what each part of the function does:

- `df['average_purchase'] = df['purchase_amount'] / df['purchase_frequency']`: This line creates a new column named 'average_purchase' in the DataFrame. The values in this new column are calculated by dividing the 'purchase_amount' column by the 'purchase_frequency' column.

- `df.sort_values('purchase_amount', ascending=False)`: This line sorts the DataFrame by the 'purchase_amount' column in descending order. The `head()` method is used to display the first 5 rows of the sorted DataFrame.

These operations help in performing calculations, sorting data, and adding new columns to a DataFrame, which are fundamental tasks in data analysis and manipulation.




Now it's time to invoke the functions we've created. First, let's invoke the `load_data()` function and set it to a variable `df`.

In [7]:
df = load_data()

Next, let's invoke the `explore_dataframe()` function with `df` as the argument. Be sure to study the data.

In [8]:
print("\n1. Exploring DataFrame")
explore_dataframe(df)


1. Exploring DataFrame
First 5 rows:
   user_id  age  annual_income  purchase_amount  loyalty_score region  \
0        1   25          45000              200            4.5  North   
1        2   34          55000              350            7.0  South   
2        3   45          65000              500            8.0   West   
3        4   22          30000              150            3.0   East   
4        5   29          47000              220            4.8  North   

   purchase_frequency  
0                  12  
1                  18  
2                  22  
3                  10  
4                  13  

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   user_id             238 non-null    int64  
 1   age                 238 non-null    int64  
 2   annual_income       238 non-null    int64  
 3   purchase

Next, let's invoke the `data_selection()` function with `df` as the argument. Take a look at the output.

In [9]:
print("\n2. Data Selection and Indexing")
data_selection(df)


2. Data Selection and Indexing
Selecting 'age' and 'annual_income' columns:
   age  annual_income
0   25          45000
1   34          55000
2   45          65000
3   22          30000
4   29          47000

Filtering customers with loyalty_score > 75:
Empty DataFrame
Columns: [user_id, age, annual_income, purchase_amount, loyalty_score, region, purchase_frequency]
Index: []

Using loc to select specific rows and columns:
   user_id  age  purchase_amount
0        1   25              200
1        2   34              350
2        3   45              500
3        4   22              150
4        5   29              220


Finally, let's invoke the `data_operations()` function with `df` as the argument. Take a look at the output.

In [10]:
print("\n3. Basic Data Operations")
data_operations(df)


3. Basic Data Operations
Added 'average_purchase' column:
   user_id  age  annual_income  purchase_amount  loyalty_score region  \
0        1   25          45000              200            4.5  North   
1        2   34          55000              350            7.0  South   
2        3   45          65000              500            8.0   West   
3        4   22          30000              150            3.0   East   
4        5   29          47000              220            4.8  North   

   purchase_frequency  average_purchase  
0                  12         16.666667  
1                  18         19.444444  
2                  22         22.727273  
3                  10         15.000000  
4                  13         16.923077  

Sorting by purchase_amount (descending):
     user_id  age  annual_income  purchase_amount  loyalty_score region  \
119      120   55          75000              640            9.5   West   
179      180   55          75000              640         

**Congratulations!** You have successfully completed the Pandas Basics tutorial. Now it's time to test your skills and apply your knowledge with a challenge. Navigate to the [Employee Performance Challenge](./employee_performance_challenge.md) file and follow the instructions.