# Python Basics 10
## Pandas
***
This notebook covers:
- The Pandas DataFrame data format
- Reading and creating DataFrames
- Statistical methods with DataFrames
***

## Introduction

 The `pandas` module was developed to equip `Python` with the most important functions for handling and examining large datasets. <br>

 In `pandas`, the **`DataFrame`** class is introduced, which serves as an array-like data structure and goes beyond the capabilities of `numpy` arrays to enable more advanced data manipulation and exploration. <br>

 The main functions of `pandas` include: <br>
 * Reading data from various file formats (CSV, Excel spreadsheets, etc.). <br>
 * Managing this data (including deleting, adding, modifying, statistical visualization, etc.). <br>

 This lesson has the following learning objectives: <br>
 * Understanding the structure of a `DataFrame`. <br>
 * Creating a first `DataFrame`. <br>
 * Performing a first exploration of a dataset with the `DataFrame` class. <br>

#### Exercises:
> (a) Import the `pandas` module under the name `pd`. <br>

In [2]:
# Your Solution:




#### Solution:

In [3]:
import pandas as pd

## 1. Structure of a DataFrame

 A `DataFrame` has the form of a **matrix** with unique row and column indices. In general, the columns are labeled with names, while the rows have unique identifiers. <br>

 A `DataFrame` resembles tables in a **database**. The various **datasets** in database tables (such as people, animals, objects, etc.) correspond to the **rows**, and their **properties** form the **columns**: <br>

  |  |  Name   |  Gender |  Height |Age|
  |--|--------|-------|--------|---|
  |**0** |  Robert|   M   |   174  |23 |
  |**1** |  Mark  |   M   |   182  |40 |
  |**2** |  Aline |   F   |   169  |56 |

 * The above `DataFrame` summarizes information about **3 people**: the `DataFrame` therefore has **3 rows**. <br>
 * For each of these people there are **4 variables** (Name, Gender, Height and Age): the `DataFrame` therefore has **4 columns**. <br>

 The column with the **row numbering** is called the **index** and is managed differently than the other columns of the `DataFrame`. The index can be set by default (follows row numbering), defined by one (or more) columns of the `DataFrame`, or even set with a list we specify. <br>

 * **Example:** Default indexing (row numbering), nothing needs to be specified here: <br>


  |  |  Name   |  Gender |  Height |Age|
  |--|--------|-------|--------|---|
  |**0** |  Robert|   M   |   174  |23 |
  |**1** |  Mark  |   M   |   182  |40 |
  |**2** |  Aline |   F   |   169  |56 |

 * **Example:** Indexing by the `'Name'` column: <br>


  |        |  Gender |  Height |Age|
  |--------|-------|--------|---|
  |**Robert** |  M   |   174  |23 |
  |**Mark** |  M   |   182  |40 |
  |**Aline** | F   |   169  |56 |

 * **Example:** Indexing with the list `['person_1', 'person_2', 'person_3']`: <br>


  |  |  Name   |  Gender |  Height |Age|
  |--|--------|-------|--------|---|
  |**person_1** |  Robert|   M   |   174  |23 |
  |**person_2** |  Mark  |   M   |   182  |40 |
  |**person_3** |  Aline |   F   |   169  |56 |

 How to define the index when creating a `DataFrame` will be discussed in detail later. <br>

 The `DataFrame` class has several advantages over a `numpy` array: <br>

 * Visually, a `DataFrame` is much more **readable** thanks to the more unique column and row indexing. <br>
 * While elements within a column are of the same type, the **types of elements** can vary from column to column, a capability that is not present in `numpy` arrays since they only support data of the same type. <br>
 * The `DataFrame` class offers a greater selection of methods for handling and preprocessing database objects (e.g., tables), while `numpy` specializes in optimized calculations. <br>



## 2. Creating a DataFrame from a NumPy Array

 It is possible to directly create a `DataFrame` from a `numpy` array with the `DataFrame()` constructor. However, this is not very practical since the data types for all columns must be the same. <br>

 Let's take a closer look at the header of this constructor. <br>

```python
pd.DataFrame(data, index, columns, ...)
```

 * The parameter `data` contains the **data** to be formatted (`numpy` array, list, dictionary, or another `DataFrame`). <br>
 * The parameter `index` must, if specified, be a **list** with the **indices of the entries**. <br>
 * The parameter `columns` must, if specified, be a **list** with the **names of the columns**. <br>

 For additional parameters, you can consult the Python [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). <br>

 * **Example:** <br>

```python
# Creating a NumPy array with 3 rows and 4 columns
array = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12]])

# Instantiating a DataFrame
df = pd.DataFrame(data=array,  # The data to be formatted
                 index=['i_1', 'i_2', 'i_3'],  # The indices of the entries
                 columns=['A', 'B', 'C', 'D'])  # The names of the columns
```

 This creates the following `DataFrame`: <br>

  | | A | B | C | D |
  | --------- | --- | ---- | ---- | ---- |
  |**i_1**| 1 |2| 3 |4|
  |**i_2**| 5 | 6 | 7 | 8 |
  |**i_3**|9 | 10 | 11 | 12 |



## 3. Creating a DataFrame from a Dictionary

 Another way to create a `DataFrame` is to use a dictionary. With a dictionary, the columns can have different types and their names are already set when creating the `DataFrame`. <br>

 **Example:** <br>

```python
# Creating a dictionary
dictionary = {'A': [1, 5, 9],
              'B': [2, 6, 10],
              'C': [3, 7, 11],
              'D': [4, 8, 12]}

# Instantiating a DataFrame
df = pd.DataFrame(data=dictionary,
                 index=['i_1', 'i_2', 'i_3'])
```

 This creates the same `DataFrame` as before: <br>

  | | A | B | C | D |
  | --------- | --- | ---- | ---- | ---- |
  |**i_1**| 1 |2| 3 |4|
  |**i_2**| 5 | 6 | 7 | 8 |
  |**i_3**|9 | 10 | 11 | 12 |



 #### 3.1 Exercises:
 
 The manager of a grocery store has the following food inventory: <br>

 1. **100** jars of honey with an expiration date of **08.10.2025** and a value of **2 €** each. <br>
 2. **55** packages of flour with expiration date **25.09.2024** and a price of **3 €** each. <br>
 3. **1800** bottles of wine at a price of **10 €** per unit and expiration date **15.10.2023**. <br>


> (a) Create a `DataFrame` **`df`** from a **dictionary** and display it. It should contain **for each product** the following information: <br>
>
> * Its name <br>
>
> * Its expiration date <br>
>
> * Its quantity <br>
>
> * Its price per unit <br>
>
> You can choose relevant column names and the index can be the default index (in this case we don't specify the `index` parameter). <br>

In [4]:
# Your Solution:





#### Solution:

In [31]:
dictionary = {"Product"          : ['honey', 'flour', 'wine'],
              "Expiration date"  : ['10/08/2025', '25/09/2024', '15/10/2023'],
              "Quantity"         : [100, 55, 1800], 
              "Price per unit"   : [2, 3, 10]}

df2 = pd.DataFrame(dictionary)

df2

Unnamed: 0,Product,Expiration date,Quantity,Price per unit
0,honey,10/08/2025,100,2
1,flour,25/09/2024,55,3
2,wine,15/10/2023,1800,10


## 4. Creating a DataFrame from a File

Normally, a `DataFrame` is created directly from a file that contains the desired data. The file format can be CSV, Excel, txt, etc.

The most common format is the CSV format, which stands for *Comma-Separated Values*. It is a table-like file where the values are separated by commas.

Here is an example:


```
A, B, C, D,
1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12
```


In this format:

* **The first row contains the names of the columns**, but sometimes the column names are **not provided**.

* Each **row** corresponds to a **record**.

* The values are separated by a **delimiter**. In this example, it is `','`, but it could also be `';'`.

To import the data into a `DataFrame`, we need to use the `read_csv()` function from `pandas`:

```python
pd.read_csv(filepath_or_buffer, sep=',', header=0, index_col=0 ...)
```

The **most important arguments** of the `pd.read_csv()` function to know are:

* **`filepath_or_buffer`**: The **path** to the .csv file relative to the execution environment.

  * If the file is in the same folder as the Python environment, simply enter the name of the file.

  * This path must be entered as a **string**.

* **`sep`**: The character used in the .csv file to **separate** the different columns.

  * This argument must be provided as a **character**.

* **`header`**: The **row number** that contains the **column names**.

  * For example, if the column names are in the **first** row of the `.csv` file, we specify **`header = 0`**.

  * If the column names are **not included**, we set **`header = None`**.

* **`index_col`**: The **name or number of the column** that contains the **indices** of the database.

  * If the database entries are indexed by the first column, you must specify **`index_col = 0`**.

  * Alternatively, if the entries are indexed by a column named *`"Id"`*, you can specify **`index_col = "Id"`**.

This function returns an object of type `DataFrame` that contains all the data from the file.


#### 4.1 Exercises:

> (a) Load the data from the file **`../data/transactions.csv`** into a `DataFrame` named **`transactions`**:

> * You can find the `transactions.csv` file at the following path: `"../data/transactions.csv"`.
>
> * The columns are separated by **commas**.
>
> * The names of the columns are in the **first row** of the file.
>
> * The rows of the *`DataFrame`* are indexed by the column **"transaction\_id"**, which is also the **last column**.


In [6]:
# Your Solution:




#### Solution:

In [40]:
# You can define the column name with indices
filepath = '../data/transactions.csv'

transactions = pd.read_csv(filepath_or_buffer = filepath,          # filepath
                           sep = ',',                               # Separating character
                           header = 0,
                           index_col = "transaction_id")            # Name of column with indices


# You can also directly enter the number of the column with indices

transactions2 = pd.read_csv(filepath_or_buffer = filepath,
                           sep = ',',
                           header = 0,
                           index_col = -1) # Number of the column with indices
transactions2.tail(20)

Unnamed: 0_level_0,cust_id,tran_date,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt,Store_type
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
86428172953,987420,01-02-2014,3.0,2.0,1.0,,314.79,,MBR
37659401824,654783,01-02-2014,9.0,5.0,2.0,499.0,104.79,1102.79,TeleShop
53697421089,321456,01-02-2014,2.0,1.0,4.0,299.0,62.79,1258.79,e-Shop
97835264170,789012,01-02-2014,7.0,3.0,5.0,649.0,136.29,3381.29,Flagship store
86042597135,456723,31-01-2014,10.0,6.0,3.0,199.0,41.79,639.79,MobileSales
99605321740,890765,31-01-2014,4.0,2.0,1.0,899.0,188.79,1087.79,
29456321870,567432,31-01-2014,12.0,7.0,2.0,1999.0,419.79,4417.79,TeleShop
78652109435,234109,30-01-2014,1.0,1.0,5.0,399.0,83.79,2078.79,MBR
67045982135,870543,30-01-2014,5.0,3.0,4.0,799.0,167.79,3363.79,e-Shop
10238567493,543210,30-01-2014,6.0,5.0,1.0,1299.0,,1299.0,Flagship store


We have loaded the file `transactions.csv` into the `DataFrame` **`transactions`**, which summarizes a history of transactions between 2011 and 2014. In the next section, we will examine this dataset. <br>


## 5. First Exploration of a Dataset with the `DataFrame` Class

The rest of this lesson introduces the most important **methods** of the `DataFrame` class that allow us to quickly analyze our dataset, such as: <br>

* Getting a brief **overview of the data** (`head` method, `columns` and `shape` attributes). <br>
* **Selecting values** in the `DataFrame` (`loc` and `iloc` methods). <br>
* Performing a quick **statistical analysis** of our data (`describe` and `value_counts` methods). <br>

**Reminder:** To apply a method to an object in Python (such as a DataFrame), you must add the method as a suffix to the object. <br>
**Example:** `my_object.my_method()` <br>


## 6. Visualizing a `DataFrame`: `head` Method, `columns` and `shape` Attributes

* You can preview your DataFrame by displaying **only the first few rows** of the `DataFrame`. <br>

To do this, we use the **`head()`** method and provide the **number of rows** we want to display as an argument (default is 5). <br>

It is also possible to display the **last rows** using the **`tail()`** method, which works the same way: <br>

```python
# Display the first 10 rows of my_dataframe
my_dataframe.head(10)
```


#### 6.1 Exercises:
> (a) Display the first **20 rows** of the `transactions` `DataFrame`. <br>

In [8]:
# Your solution:





#### Solution:

In [9]:
transactions.head(20)

Unnamed: 0_level_0,cust_id,tran_date,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt,Store_type
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
29258453508,80712190438,28-02-2014,1.0,1.0,3.0,399.0,83.79,1280.79,e-Shop
51750724947,270384,28-02-2014,5.0,3.0,2.0,799.0,167.79,1765.79,e-Shop
93274880719,273420,28-02-2014,6.0,5.0,1.0,1299.0,272.79,1571.79,TeleShop
51750724947,271509,28-02-2014,11.0,6.0,4.0,249.0,52.29,1048.29,e-Shop
93274880719,273420,27-02-2014,6.0,5.0,2.0,599.0,125.79,1323.79,TeleShop
76521489302,891652,27-02-2014,8.0,4.0,3.0,349.0,73.29,1120.29,MobileSales
65432198710,345267,26-02-2014,3.0,2.0,1.0,1499.0,314.79,1813.79,MBR
48392018472,652198,26-02-2014,9.0,5.0,2.0,499.0,104.79,1102.79,e-Shop
91827364509,784512,25-02-2014,2.0,1.0,4.0,299.0,62.79,1258.79,Flagship store
12345678901,367891,25-02-2014,7.0,3.0,3.0,649.0,136.29,2083.29,TeleShop


#### 
> (b) Display the last **10 rows** of the `transactions` `DataFrame`. <br>

In [10]:
# Your solution:





#### Solution:

In [11]:
transactions.tail(10)

Unnamed: 0_level_0,cust_id,tran_date,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt,Store_type
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
50369821470,908765,29-01-2014,11.0,6.0,3.0,249.0,52.29,799.29,TeleShop
49637582104,678901,29-01-2014,3.0,2.0,2.0,1499.0,314.79,3312.79,MobileSales
26385741950,345678,29-01-2014,9.0,5.0,4.0,499.0,104.79,2100.79,e-Shop
45679812375,680246,06-02-2014,1.0,1.0,5.0,399.0,83.79,2078.79,MBR
58329761480,12345,28-01-2014,2.0,1.0,1.0,299.0,62.79,361.79,TeleShop
68953042176,789456,28-01-2014,7.0,3.0,5.0,649.0,136.29,3381.29,Flagship store
70216493785,123789,28-01-2014,,6.0,2.0,199.0,41.79,439.79,e-Shop
16780395246,456123,27-01-2014,4.0,2.0,3.0,899.0,188.79,2885.79,MobileSales
30276498531,789456,27-01-2014,8.0,4.0,4.0,349.0,73.29,1469.29,MBR
82647193508,159357,27-01-2014,1.0,1.0,1.0,399.0,83.79,482.79,TeleShop



#### 

You can retrieve the **names of the columns** of a `DataFrame` using the **`columns`** attribute. <br>

```python
# Creating a DataFrame df from a dictionary
dictionary = {'A': [1, 5, 9],
              'B': [2, 6, 10],
              'C': [3, 7, 11],
              'D': [4, 8, 12]}

df = pd.DataFrame(data=dictionary, index=['i_1', 'i_2', 'i_3'])
```

These commands create the same `DataFrame` as before: <br>

|          | A | B  | C  | D  |
| -------- | - | -- | -- | -- |
| **i\_1** | 1 | 2  | 3  | 4  |
| **i\_2** | 5 | 6  | 7  | 8  |
| **i\_3** | 9 | 10 | 11 | 12 |

```python
# Display the columns of the DataFrame df
print(df.columns)
>>> ['A', 'B', 'C', 'D']
```

The list of column names can be used to loop through the columns of a `DataFrame`. <br>

If you want to know how many transactions (rows) and how many features (columns) the dataset contains, you can use the **`shape`** attribute. It shows the **dimensions** of our `DataFrame` as a tuple in the form (number of rows, number of columns): <br>

```python
# Display the dimensions of df
print(df.shape)
>>> (3, 4)
```

> (c) Display the **dimensions** of the `DataFrame` `transactions` as well as the **name of the 5th column**. Remember that Python uses zero-based indexing. <br>


In [12]:
# Your solution:





#### Solution:

In [13]:
print(transactions.shape)
print(transactions.columns[4])

(96, 9)
Qty



## 7. Selecting Columns from a `DataFrame`

Extracting columns from a `DataFrame` is almost identical to extracting data from a dictionary. <br>

To extract a **column** from a `DataFrame`, you simply specify the **name** of the column to extract **in square brackets**. To extract **multiple** columns, provide a **list of names** in square brackets: <br>

```python
# Display the 'cust_id' column
print(transactions['cust_id'])

# Extract the 'cust_id' and 'Qty' columns from transactions
cust_id_qty = transactions[["cust_id", "Qty"]]
```

`cust_id_qty` is a new **`DataFrame`** that contains only the `'cust_id'` and `'Qty'` columns. <br>

Displaying the first 3 rows of **`cust_id_qty`** yields: <br>

| transaction\_id | cust\_id | Qty |
| --------------- | -------- | --- |
| **80712190438** | 270351   | -5  |
| **29258453508** | 270384   | -5  |
| **51750724947** | 273420   | -2  |

When preparing a dataset for later use, it's best to **separate** the **categorical** variables from the **quantitative** variables: <br>

* A *categorical* variable is one that only takes on a finite **number** of *categories*. <br>

  * The categorical variables in the `transactions` DataFrame are: `['cust_id', 'tran_date', 'prod_subcat_code', 'prod_cat_code', 'Store_type']`. <br>

* A *quantitative* variable is one that measures an amount and can take on **infinitely** many values. <br>

  * The quantitative variables in `transactions` are: `['Qty', 'Rate', 'Tax', 'total_amt']`. <br>

This distinction is important because certain basic operations, such as calculating an average, are only applicable to quantitative variables. <br>


#### 7.1 Exercises:

> (a) Store the **categorical** variables from `transactions` in a `DataFrame` named **`cat_vars`**. <br>
> (b) Store the **quantitative** variables from `transactions` in a `DataFrame` named **`num_vars`**. <br>
> (c) Display the first 5 rows of each `DataFrame`. <br>


In [14]:
# Your solution:





#### Solution:

In [15]:
# Extracting the categorical variables
cat_var_names = [ 'cust_id', 'tran_date', 'prod_subcat_code', 'prod_cat_code' , 'Store_type']
cat_vars = transactions[cat_var_names]

# Extracting the quantitative variables
num_var_names = ['Qty', 'Rate', 'Tax', 'total_amt']
num_vars = transactions[num_var_names]

# Displaying the first 5 rows of each DataFrame
print("Categorical variables:")
display(cat_vars.head())

print("Quantitative variables:")
display(num_vars.head())


Categorical variables:


Unnamed: 0_level_0,cust_id,tran_date,prod_subcat_code,prod_cat_code,Store_type
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
29258453508,80712190438,28-02-2014,1.0,1.0,e-Shop
51750724947,270384,28-02-2014,5.0,3.0,e-Shop
93274880719,273420,28-02-2014,6.0,5.0,TeleShop
51750724947,271509,28-02-2014,11.0,6.0,e-Shop
93274880719,273420,27-02-2014,6.0,5.0,TeleShop


Quantitative variables:


Unnamed: 0_level_0,Qty,Rate,Tax,total_amt
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
29258453508,3.0,399.0,83.79,1280.79
51750724947,2.0,799.0,167.79,1765.79
93274880719,1.0,1299.0,272.79,1571.79
51750724947,4.0,249.0,52.29,1048.29
93274880719,2.0,599.0,125.79,1323.79



## 8. Selecting Rows from a `DataFrame`: `loc` and `iloc` Methods

To extract one or more **rows** from a `DataFrame`, we use the **`loc`** method. `loc` is a special kind of method because its arguments are provided **in square brackets** instead of parentheses. Using this method is very similar to list indexing. <br>

To retrieve the row with index `i` from a `DataFrame`, just pass `i` as the argument to the `loc` method: <br>

```python
# We retrieve the row with index 80712190438 from the num_vars DataFrame
print(num_vars.loc[80712190438])
```

```
                 Rate    Tax  total_amt
transaction_id                         
80712190438    -772.0  405.3    -4265.3
80712190438     772.0  405.3     4265.3
```

To retrieve **multiple rows**, you can either: <br>

* Provide a **list of indices**, or
* Use a **slice**, by specifying the start and end indices. Note that to use slicing with `loc`, the indices must be **unique**, which is **not** the case for `transactions`. <br>

```python
# We retrieve the rows with indices 80712190438, 29258453508, and 51750724947 from the transactions DataFrame
transactions.loc[[80712190438, 29258453508, 51750724947]]
```

`loc` can also take a column or a **list of columns** as an argument to refine the data extraction: <br>

```python
# We extract the columns 'Tax' and 'total_amt' from the rows with indices 80712190438 and 29258453508
transactions.loc[[80712190438, 29258453508], ['Tax', 'total_amt']]
```

This command produces the following `DataFrame`: <br>

| transaction\_id | Tax     | total\_amt |
| --------------- | ------- | ---------- |
| **80712190438** | 405.300 | -4265.300  |
| **80712190438** | 405.300 | 4265.300   |
| **29258453508** | 785.925 | -8270.925  |
| **29258453508** | 785.925 | 8270.925   |

The **`iloc`** method is used to index a `DataFrame` **just like a NumPy array**: by specifying only the **numeric** indices of rows and columns. This allows slicing with no restrictions: <br>

```python
# Extract the first 4 rows and the first 3 columns from transactions
transactions.iloc[0:4, 0:3]
```

This code generates the following `DataFrame`: <br>

| transaction\_id | cust\_id | tran\_date | prod\_subcat\_code |
| --------------- | -------- | ---------- | ------------------ |
| **80712190438** | 270351   | 28.02.2014 | 1.0                |
| **29258453508** | 270384   | 27.02.2014 | 5.0                |
| **51750724947** | 273420   | 24.02.2014 | 6.0                |
| **93274880719** | 271509   | 24.02.2014 | 11.0               |

If the DataFrame uses default row indexing, then `loc` and `iloc` methods are **equivalent**. <br>


## 9. Conditional Indexing of a `DataFrame`

Like with `numpy arrays`, we can use **conditional indexing** to extract rows from a `DataFrame` that meet a certain condition. <br>

In the following figure, we select the rows of the `DataFrame` `df` **where column `col 2` equals 3**. <br>

<br>

<img src="../imgs/indexation_cond_final.png" style='height:200px'> <br>

<br>

There are two ways to write conditional indexing for a `DataFrame`: <br>

```python
# We select the rows of DataFrame df where column 'col 2' equals 3.
df[df['col 2'] == 3]

df.loc[df['col 2'] == 3]
```

If we want to assign a **new value** to these entries, we must use the **`loc`** method. <br>

Indexing with the syntax `df[df['col 2'] == 3]` only returns a **copy** of these entries and does not allow access to the memory location of the data. <br>

The manager of the transactions listed in the **`transactions`** `DataFrame` wants access to the **identifiers** of customers who made an **online purchase** (i.e., in an `"e-Shop"`), as well as **the date of the corresponding transaction**. <br>

We have the following information about the columns of `transactions`: <br>

| Column name      | Description                                        |
|:-----------------|:---------------------------------------------------|
| `'cust_id'`      | The **identifier** of the customer                |
| `'Store_type'`   | The **type of store** where the transaction took place |
| `'tran_date'`    | The **date** of the transaction                   |

#### 9.1 Exercises:
> (a) Save the transactions that took place in a store of type `"e-Shop"` in a `DataFrame` named **`transactions_eshop`**. <br>
>
> (b) Save in another `DataFrame` named **`transactions_id_date`** the **customer identifiers** and **transaction date** from the `transactions_eshop` `DataFrame`. <br>
>
> (c) Display the first 5 rows of `transactions_id_date`. <br>

In [16]:
# Your solution:





#### Solution:

In [17]:
# Creating transactions_eshop with Conditional Indexing
transactions_eshop = transactions.loc[transactions['Store_type'] == 'e-Shop']

# Extracting the customer identification column 'cust_id' and the transaction date column 'tran_date' 
transactions_id_date = transactions_eshop[['cust_id', 'tran_date']]

# Displaying the first 5 rows of transactions_id_date
transactions_id_date.head()

Unnamed: 0_level_0,cust_id,tran_date
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1
29258453508,80712190438,28-02-2014
51750724947,270384,28-02-2014
51750724947,271509,28-02-2014
48392018472,652198,26-02-2014
78901234567,123456,24-02-2014


#### 
Now the manager wants access to the transactions made by the customer with ID `567890`. <br>

> (d) Save all transactions with customer ID `567890` in a `DataFrame` named **`transactions_client_567890`**. <br>
>
> (e) A column in a `DataFrame` can be iterated over just like a list using a loop (`for value in df['column']:`). Calculate and display the total transaction amount for the customer with ID `567890` using a `for` loop over the `'total_amt'` column. <br>

In [18]:
# Your solution:





#### Solution:

In [42]:
# Extracting the transactions of the customer with customer ID 567890
transactions_client_567890 = transactions[transactions['cust_id'] == 567890]
transactions_client_567890

Unnamed: 0_level_0,cust_id,tran_date,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt,Store_type
transaction_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
24162562563,567890,11-02-2014,9.0,5.0,3.0,499.0,104.79,1601.79,TeleShop


In [19]:
# Calculating the total cost amount
total = 0

# For each amount in the 'total_amt' column
for amount in transactions_client_567890['total_amt']:
    # We add the amount
    total += amount
    
print(total)

# Or much simpler without a for loop:
print(sum(transactions_client_567890['total_amt']))

1601.79
1601.79


## 10. Quick statistical analysis of data in a `DataFrame`

The **`describe()`** method of a `DataFrame` provides a summary of the **descriptive statistics** (minimum, maximum, mean, quantiles,...) of its **quantitative** variables. It is therefore a very useful tool for a first insight into the nature and distribution of these variables. <br>

To analyze **categorical** variables, you can first use the **`value_counts()`** method, which returns the **number of occurrences** for each category of these variables. The `value_counts()` method cannot be used directly on a `DataFrame`, but only on the columns of the `DataFrame`, which are objects of the **`pd.Series`** class. <br>

#### 10.1 Tasks:
> (a) Use the `describe` method of the `DataFrame` `transactions`. <br>
>
> (b) The quantitative variables of `transactions` are `'Qty'`, `'Rate'`, `'Tax'` and `'total_amt'`. Are the statistics of the `describe` method calculated by default **only** for the quantitative variables? <br>
>
> (c) Display the number of occurrences of each category in the `Store_type` column using the `value_counts` method. <br>

In [20]:
# Your solution:





#### Solution:

In [51]:
# (a)
print(transactions.describe())

# (b)
# Answer: no, they are calculated on all variables

# (c)
nmbr_of_occ = transactions['Store_type'].value_counts()
nmbr_of_occ

            cust_id  prod_subcat_code  prod_cat_code        Qty         Rate  \
count  9.600000e+01         93.000000      95.000000  92.000000    94.000000   
mean   1.682032e+09          5.688172       3.431579   2.750000   730.914894   
std    1.158825e+10          3.240298       1.866004   1.331157   480.918200   
min    1.234500e+04          1.000000       1.000000   1.000000   199.000000   
25%    2.734200e+05          3.000000       2.000000   2.000000   349.000000   
50%    5.432100e+05          6.000000       3.000000   3.000000   624.000000   
75%    7.894560e+05          8.000000       5.000000   4.000000   899.000000   
max    8.071219e+10         12.000000       7.000000   5.000000  1999.000000   

              Tax    total_amt  
count   94.000000    90.000000  
mean   153.603830  2044.131222  
std    101.727973  1561.723227  
min     41.790000   240.790000  
25%     73.290000  1087.790000  
50%    125.790000  1586.790000  
75%    188.790000  2690.415000  
max    419.7900

Store_type
e-Shop            27
TeleShop          23
MBR               16
MobileSales       14
Flagship store    12
Name: count, dtype: int64

#### 
The `describe` method has calculated statistics for the variables `cust_id`, `prod_subcat_code` and `prod_cat_code`, even though these are **categorical** variables. <br>

These statistics naturally make **no sense**. The `describe` method has treated these variables as quantitative because the categories they take are numeric in nature. <br>

Therefore, you must pay attention to the results returned by the `describe` method and always keep in mind what the variables actually reflect. <br>

The manager wants to create a quick report on the characteristics of the `transactions` `DataFrame`: In particular, he wants to know the **average amount spent** as well as the **maximum** quantity purchased. <br>

> (d) What is the **average** total amount? We are interested in the `'total_amt'` column of `transactions`. <br>
>
> (e) What is the **maximum** purchased quantity? We are looking at the `'Qty'` column of `transactions`. <br>

In [22]:
# Your solution:





#### Solution:

In [23]:
# Apply the describe method to the transactions DataFrame
transactions.describe()

# The average amount spent is €1632.60.
# The maximum purchased quantity is 5.

Unnamed: 0,cust_id,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt
count,96.0,93.0,95.0,92.0,94.0,94.0,90.0
mean,1682032000.0,5.688172,3.431579,2.75,730.914894,153.60383,2044.131222
std,11588250000.0,3.240298,1.866004,1.331157,480.9182,101.727973,1561.723227
min,12345.0,1.0,1.0,1.0,199.0,41.79,240.79
25%,273420.0,3.0,2.0,2.0,349.0,73.29,1087.79
50%,543210.0,6.0,3.0,3.0,624.0,125.79,1586.79
75%,789456.0,8.0,5.0,4.0,899.0,188.79,2690.415
max,80712190000.0,12.0,7.0,5.0,1999.0,419.79,7809.79


#### 
Some transactions have **negative** amounts. <br>

These are transactions that were cancelled and refunded to the customer. These amounts disrupt the distribution of amounts, which gives us **poor estimates** of the mean and quantiles of the variable `total_amt`. <br>

> (f) What is the average amount of transactions with **positive** amounts? <br>

In [24]:
# Your solution:





#### Solution:

In [25]:
transactions[transactions['total_amt'] > 0].describe()

# The average amount spent is still €1632.60.
# This means there were no refunds!



Unnamed: 0,cust_id,prod_subcat_code,prod_cat_code,Qty,Rate,Tax,total_amt
count,90.0,87.0,89.0,90.0,90.0,88.0,90.0
mean,1794122000.0,5.712644,3.449438,2.777778,729.555556,152.04,2044.131222
std,11964000000.0,3.291667,1.889031,1.330521,485.618709,102.322765,1561.723227
min,12345.0,1.0,1.0,1.0,199.0,41.79,240.79
25%,273420.0,3.0,2.0,2.0,349.0,73.29,1087.79
50%,537426.0,6.0,3.0,3.0,599.0,115.29,1586.79
75%,789345.0,8.5,5.0,4.0,899.0,188.79,2690.415
max,80712190000.0,12.0,7.0,5.0,1999.0,419.79,7809.79


## Conclusion and Summary

The `DataFrame` class of the `pandas` module will be your preferred data structure when you want to explore, analyze, and process datasets and databases. <br>

In this brief introduction, you have learned: <br>

* To create a *`DataFrame`* from a *`numpy`* array and a dictionary using the **`pd.DataFrame`** constructor. <br>

* To create a *`DataFrame`* from a *`.csv`* file using the **`pd.read_csv`** function. <br>

* To display the first and last rows of a *`DataFrame`* using the **`head`** and `tail` methods. <br>

* To select one or more columns of a `DataFrame` by entering their names in square brackets like with a dictionary. <br>

* To select one or more rows of a *`DataFrame`* by specifying their indices using the **`loc`** and **`iloc`** methods. <br>

* To select the rows of a *`DataFrame`* that meet a certain condition using **conditional indexing**. <br>

* To perform a quick statistical analysis of the quantitative variables of a *`DataFrame`* using the **`describe`** method. <br>

The `transactions` dataset we used is very clean. The variables are neatly filled and contain no missing values. In practice, this is **rarely** the case. Therefore, in the next lesson we will see how to clean datasets with `pandas`. <br>