# Pandas Basics: Customer Purchasing Behavior Analysis

In this tutorial, we will explore the fundamental operations and functionalities of the Pandas library using a dataset on customer purchasing behavior. 

By the end of this lesson, you will learn how to:
- load data into a DataFrame
- perform basic data exploration
- select and filter data
- execute basic data operations


This hands-on approach will help you gain practical skills in data manipulation and analysis with Pandas.

### What is a DataFrame?

A Pandas DataFrame is a two-dimensional table with labeled rows and columns that can hold different types of data and can be resized. It is similar to a table in a database, an Excel spreadsheet. Each column in a DataFrame can contain different data types (e.g., integers, floats, strings), and the DataFrame provides a variety of methods for data manipulation, analysis, and visualization.

Key features of a DataFrame include:
- **Labeled axes**: Both rows and columns have labels, which makes it easy to access and manipulate data.
- **Heterogeneous data**: Different columns can contain different types of data.
- **Size-mutable**: You can add or remove columns and rows.
- **Integrated operations**: Built-in methods for data manipulation, such as filtering, grouping, and aggregation.

For more details, you can refer to the [Pandas documentation on DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html).



## Turn off Cursor Predictor

Before we begin, turn off the Cursor Predictor in your workspace. 

In the bottom right corner of the screen, hover over the `Cursor Tab` and in the box, check `Disable Globally`. Also choose `disabled` in the `Cursor Prediction` dropbox. This will enable you to create the code on your own without the help of Cursor's AI. Creating the functions yourself will help you learn the material.

### Install Virtual Environment

If you do not have a virtual environment, install one using the following command:

```bash
python3 -m venv venv
```

```bash
source venv/bin/activate
```

### Select Kernel

In the top right corner of the screen, click on the `Select Kernel` tab and select `Python Kernels` then choose the `Recommended` kernel which should be the `venv(Python 3.x.x)` kernel. This will enable you to use the Pandas library in your workspace.

When you run your first codeblock, remember to click `Install` for the `ipykernel` popup

### Install Pandas

Use the following command in your terminal to install the **Pandas** library:

```bash
pip install pandas 
```


After you have installed the library, run the block below to import the library into your workspace.

In [1]:
import pandas as pd

### Data Research

All of our methods will be based off of the `customer_data.csv` file. It is always a good idea to take a look at your dataset before you start working with it. So, before we start working with the data, let's take some time to research the data and understand it.

Sometimes the dataset will have a description of what each column represents, this is called the **[Dataset Dictionary](./data/dataset-description.md)**. 

Next you'll want to take a look at the data located in the actual csv file. -  **[Customer Data CSV LINK](./data/customer_data.csv)**.


### Create Functions

Now that we have a better understanding of the data, let's create our functions which will utilize the methods from the Pandas library.

Our functions will be:

- load_data() - loads the data from the CSV file into a DataFrame
- explore_dataframe(df) - explores the DataFrame
- data_selection(df) - selects and filters the data from the DataFrame
- data_operations(df) - performs basic data operations on the DataFrame

**Note:** Try to read and interpret the code for each function `before` you read the explanation.


### Load Data

Now that we have an understanding of the data, let's start with the `load_data()` function. What is it doing? After you have discussed the code, run the code block below to create the function. Again, you will only see a checkmark, however the function declaration will be stored in the `pandas_basics_lesson.ipynb` workspace for later invocations.


In [None]:
def load_data():
    return pd.read_csv('data/customer_data.csv')

Now let's invoke the function and store the output in a variable `df`.

In [None]:
df = load_data()


#### Function Explanation: `load_data()`

The `load_data()` function reads a CSV file named `customer_data.csv` from the `data` directory and loads it into a Pandas DataFrame. This DataFrame will contain all the customer data from the CSV file, making it ready for further analysis and manipulation using Pandas.

We've placed this DataFrame in the variable `df`.



### Explore Data

Let's create the `explore_dataframe()` function.

This function will demonstrate basic DataFrame operations using common methods:

In [None]:
def explore_dataframe(df):
    print("First 5 rows:")
    print(df.head())
    
    print("\nDataFrame Info:")
    print(df.info())
    
    print("\nSummary Statistics:")
    print(df.describe())


Let's run the `explore_dataframe(df)` function.

In [None]:
print("\n1. Exploring the DataFrame")
explore_dataframe(df)

#### Function Explanation: `explore_dataframe(df)`

The `explore_dataframe(df)` function demonstrates basic DataFrame operations using the following methods:

- `df.head()`: This method returns the first 5 rows of the DataFrame by default. You can specify a different number of rows by passing an argument, e.g., `df.head(10)` to return the first 10 rows. It is useful for quickly inspecting the initial entries of the dataset.

- `df.info()`: Provides a concise summary of the DataFrame, including the index dtype and column dtypes, non-null values, and memory usage.

- `df.describe()`: Generates descriptive [summary statistics](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html) for numerical columns, such as count, mean (average), std (standard deviation), min, 25%, 50%, 75%, and max.

These methods help in understanding the structure and properties of the DataFrame, making it easier to perform subsequent data analysis tasks.



### Selecting Columns from a DataFrame

Selecting specific columns from a DataFrame is a common operation in data analysis. It allows you to focus on the relevant parts of your dataset, making it easier to perform analysis, visualization, and reporting. By selecting only the necessary columns, you can reduce memory usage and improve the performance of your data processing tasks. Additionally, selecting columns can help in cleaning and preparing data by isolating the features of interest. 

Let's create the `data_selection(df)` function.

In [None]:
def data_selection(df):
    print("Selecting 'age' and 'annual_income' columns:")
    print(df[['age', 'annual_income']].head())
    
    print("\nFiltering customers with loyalty_score > 4.5:")
    print(df[df['loyalty_score'] > 4.5].head())
    
    print("\nUsing loc to select specific rows and columns:")
    print(df.loc[0:4, ['user_id', 'age', 'purchase_amount']])


Let's run the `data_selection(df)` function.

In [None]:
print("\n2. Data Selection and Indexing")
data_selection(df)

#### Function Explanation: `data_selection(df)`

The `data_selection(df)` function demonstrates various techniques for selecting and filtering data within a DataFrame. Here is a breakdown of what each part of the function does:

- `df[['age', 'annual_income']]`: This line selects the 'age' and 'annual_income' columns from the DataFrame and returns a new DataFrame containing only these columns. The `head()` method is then used to display the first 5 rows of this new DataFrame.

- `df[df['loyalty_score'] > 75]`: This line filters the DataFrame to include only the rows where the 'loyalty_score' column value is greater than 75. The `head()` method is used again to display the first 5 rows of this filtered DataFrame.

- `df.loc[0:4, ['user_id', 'age', 'purchase_amount']]`: This line uses the `loc` indexer to select rows 0 to 4, inclusive, and the specified columns from the DataFrame. The `loc` indexer allows for both row and column selection using labels. The `head()` method is used to display the first 5 rows of this selected DataFrame.

How does `loc` work? Do some research on the internet, in the documentation or using AI.

These examples showcase how to select specific columns, filter rows based on conditions, and use label-based indexing to extract desired subsets of data from a DataFrame.

### Basic Data Operations

Basic data operations involve performing calculations, sorting data, and adding new columns to a DataFrame. These operations are fundamental to data analysis and manipulation.

Let's create the `data_operations(df)` function.


In [None]:
def data_operations(df):
    df['average_purchase'] = df['purchase_amount'] / df['purchase_frequency']
    print("Added 'average_purchase' column:")
    print(df.head())
    
    print("\nSorting by purchase_amount (descending):")
    print(df.sort_values('purchase_amount', ascending=False).head())


Let's run the `data_operations(df)` function.


In [None]:
print("\n3. Basic Data Operations")
data_operations(df)

#### Function Explanation: `data_operations(df)`

The `data_operations(df)` function demonstrates basic data operations on a DataFrame. Here is a breakdown of what each part of the function does:

- `df['average_purchase'] = df['purchase_amount'] / df['purchase_frequency']`: This line creates a new column named 'average_purchase' in the DataFrame. The values in this new column are calculated by dividing the 'purchase_amount' column by the 'purchase_frequency' column.

- `df.sort_values('purchase_amount', ascending=False)`: This line sorts the DataFrame by the 'purchase_amount' column in descending order. The `head()` method is used to display the first 5 rows of the sorted DataFrame.

These operations help in performing calculations, sorting data, and adding new columns to a DataFrame, which are fundamental tasks in data analysis and manipulation.


### Visualize Data

Finally we can also visualize the data using the `matplotlib` library. While this library is not part of Pandas, it is a powerful tool for data visualization and works well with Pandas DataFrames. The seaborn library is a wrapper around matplotlib that provides a high-level interface for creating more complex and informative visualizations.

Let's first install the `matplotlib` and the `seaborn` library.

In [None]:
pip install matplotlib seaborn

    

Next, let's import the `matplotlib` library and create the `visualize_data(df)` function.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

def visualize_data(df):
    sns.histplot(df['purchase_amount'], bins=30, kde=True)
    plt.title('Distribution of Purchase Amount')
    plt.xlabel('Purchase Amount')




Let's run the `visualize_data(df)` function.

In [None]:
print("\n4. Visualizing Data")
visualize_data(df)

### Function Explanation: `visualize_data(df)`

The `visualize_data(df)` function demonstrates how to visualize data using the `matplotlib` library. Here is a breakdown of what each part of the function does:

- `import matplotlib.pyplot as plt` and `import seaborn as sns`: These lines import the necessary libraries for data visualization.

- `sns.histplot(df['purchase_amount'], bins=30, kde=True)`: This line creates a histogram of the 'purchase_amount' column. ( A histogram is a graphical representation of the distribution of a numerical variable. It shows the frequency of each value in the variable's range.) The `bins` parameter specifies the number of bins in the histogram, and `kde=True` adds a kernel density estimate curve to the histogram for a smoother visual representation.

- `plt.title('Distribution of Purchase Amount')`: This line sets the title of the plot to 'Distribution of Purchase Amount'.



If you would like to see and experiment with the full code in a .py file, you can find it [here](./pandas_basics_full_code.py).

Also, be sure to check out the Resources listed in the [README.md](./README.md) file for more information on the methods and functions used in this lesson.

And if you are ready for a challenge, check out the [5.1 Pandas Practice](https://github.com/jdrichards-pursuit/week-5.1-python-practice)