# Introduction to Pandas

## Introduction 

Pandas is a Python library used for working with data sets. It has functions for analysing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Some advantages of Pandas are:
- Fast and efficient for manipulating and analyzing data
- Data from different file objects can be loaded
- Easy handling of missing data (represented as NaN)
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Data set merging and joining
- Flexible reshaping and pivoting of data sets
- Provides time-series functionality




## Getting started

If you have Python and Anaconda already installed on a system, then installation of Pandas is very easy.

Install it using this command:

In [None]:
#  conda install pandas

To load the pandas package and start working with it, import the package as:

In [1]:
import pandas as pd

Pandas provide two data structures for manipulating data: Series and Data frames. We will focus on the second one in this introduction.

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It consists of three principal components, the data, the rows, and the columns.

Let's load a data frame and inspect it:

In [2]:
df = pd.read_csv('data/avocado.csv')
df.head() # print the first few rows

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In brief, the columns are:
- Date — The date of the sale
- AveragePrice — the average price of a single avocado
- Total Volume — Total number of avocados sold
- 4046 — Total number of avocados with PLU 4046 sold
- 4225 — Total number of avocados with PLU 4225 sold
- 4770 — Total number of avocados with PLU 4770 sold
- Total Bags — Total number of bags sold
- Small Bags — Number of small bags sold
- Large Bags — Number of large bags sold
- XLarge Bags — Number of extra-large bags sold
- type — conventional or organic
- year — the year of the sale
- region — the city or region of the observation

You can also create a data frame 'from scratch':

In [3]:
df_passengers = pd.DataFrame(
    {
        "Name": ["Braund, Mr. Owen Harris",
                 "Allen, Mr. William Henry",
                 "Bonnell, Miss. Elizabeth"],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

df_passengers

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


## Exercise
Different methods exist to inspect the dataframe and return statistical information. Try `.columns`, `index` (and not row!), ``.info()`` and ``.describe()`` on the `df` dataframe to answer the following questions:
- How many rows and columns are in this data frame?
- What type of data does the `year` column contains?
- What period of time does that table cover?
- What are the maximum and minimum average price?
- What is the mean of the number of bags sold per sale?

In [None]:
# Your code here

## Indexing, selecting

You can subset (a group of) individual columns:

In [None]:
# One individual column
df_passengers.Name
# or using the indexing ([]) operator:
df_passengers["Name"]

# Groups of column can be called as a list:
df_passengers[["Name", "Age"]]


Note that only the subset of the dataframe appears on your screen (the top 5 and bottom five values) for practical reasons.

A specific value can be called as:

In [None]:
df_passengers.Name[1]
# or
df_passengers["Name"][1]

It is also possible to modify specific columns of create new columns:

In [None]:
df_passengers['Year of birth'] = 2023 - df_passengers['Age']
df_passengers

Individual rows can be selected based on the index, using `iloc`

In [None]:
df_passengers.iloc[0]
df_passengers.iloc[[0, 2]]

## Statistical information

The summary functions mentionned above can also be applied to individual columns. Try the folowing commands: 

In [None]:
print(df.year.describe())

print(df.AveragePrice.mean())

print(df.region.unique())

To compute the correlation coefficients between pairs of columns:

In [None]:
subdf=df[['AveragePrice', 'Total Volume', 'Total Bags']]  # selecting a subset of the inital df
subdf.corr()

As expected, there is a tight correlation between the total number of avocados sold (`Total Volume`) and the number of bags sold (`Total Bags`). The average price decreases slightly with an increasing volume.

## Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

To select all the men in the above list:

In [None]:
df_passengers.Sex == 'male'

This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:

In [None]:
df_passengers[df_passengers.Sex == 'male']

## Exercise

Add a column to the avocado dataframe indicating the total price for the sale.

Determine how many avocado sales made in the region of Atlanta have reached a price above $700000.

In [None]:
# Your code here

## Grouping and sorting

The `groupby` operation allows to group the data by the values in a specific column. 

We can then find for example the minimum average price per region, or per region and per year:

In [None]:
df.groupby('region').AveragePrice.min()

In [None]:
df.groupby(['region', 'year']).AveragePrice.min()

The `sort_values()` method can be used to sort the data in many ways:

In [None]:
df.sort_values(by='AveragePrice')

`sort_values()` defaults to an ascending sort, where the lowest values go first. However, sometimes we would prefer a descending order:

In [None]:
df.sort_values(by='AveragePrice', ascending=False)

Finally, a dataframe can be sorted by more than one column at a time:

In [None]:
df.sort_values(by=['type','AveragePrice'])

## Exercise

Compare the mean average price for organic and conventional avocados.

In [None]:
# Your code here

Create a new dataframe containing only the date of the sale, the average price and the region, sorted by date (from the most recent to the oldest one) and by total price (from the cheapeast to the most expensive).

In [None]:
# Your code here