# Introduction to Pandas

## Introduction 

Pandas is a Python library used for working with data sets. It has functions for analysing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Some advantages of Pandas are:
- Fast and efficient for manipulating and analyzing data
- Data from different file objects can be loaded
- Easy handling of missing data (represented as NaN)
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Data set merging and joining
- Flexible reshaping and pivoting of data sets
- Provides time-series functionality




## Getting started

If you have Python and Anaconda already installed on a system, then installation of Pandas is very easy.

Install it using this command:

To load the pandas package and start working with it, import the package as:

In [None]:
import pandas as pd


Pandas provide two data structures for manipulating data: Series and Data frames. We will focus on the second one in this introduction.

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It consists of three principal components, the data, the rows, and the columns.

Let's load a data frame and inspect it:

In [None]:
import os
import json

#!pip install kaggle

!kaggle datasets list -s avocado --sort-by votes 

 

In [None]:

f = open('/Users/aromanelli/Downloads/kaggle.json')
 
# returns JSON object as 
# a dictionary
data = json.load(f)
 
# Iterate through the json

for i in data:
    print([i])
 
f.close()

username = data['username']
key = data['key']

#os.environ["KAGGLE_USERNAME"] = username
#os.environ["KAGGLE_KEY"] = key

#!kaggle datasets download -q neuromusic/avocado-prices
#!unzip avocado-prices.zip  



In [None]:
avocado = pd.read_csv('/Users/aromanelli/Downloads/avocado.csv')
avocado.head() # print the first few rows

In brief, the columns are:
- Date — The date of the sale
- AveragePrice — the average price of a single avocado
- Total Volume — Total number of avocados sold
- 4046 — Total number of avocados with PLU 4046 sold
- 4225 — Total number of avocados with PLU 4225 sold
- 4770 — Total number of avocados with PLU 4770 sold
- Total Bags — Total number of bags sold
- Small Bags — Number of small bags sold
- Large Bags — Number of large bags sold
- XLarge Bags — Number of extra-large bags sold
- type — conventional or organic
- year — the year of the sale
- region — the city or region of the observation

You can also create a data frame 'from scratch':

In [None]:
df_passengers = pd.DataFrame(
    {
        "Name": ["Braund, Mr. Owen Harris",
                 "Allen, Mr. William Henry",
                 "Bonnell, Miss. Elizabeth"],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

df_passengers

## Exercise
Different methods exist to inspect the dataframe and return statistical information. Try `.columns`, `index` (and not row!), ``.info()`` and ``.describe()`` on the `df` dataframe to answer the following questions:
- How many rows and columns are in this data frame?
- What type of data does the `year` column contains?
- What period of time does that table cover?
- What are the maximum and minimum average price?
- What is the mean of the number of bags sold per sale?

In [None]:
# Your code here

## Indexing, selecting

You can subset (a group of) individual columns:

In [None]:
# One individual column
df_passengers.Name
# or using the indexing ([]) operator:
df_passengers["Name"]

# Groups of column can be called as a list:
#df_passengers[["Name", "Age"]]


Note that only the subset of the dataframe appears on your screen (the top 5 and bottom five values) for practical reasons.

A specific value can be called as:

In [None]:
df_passengers.Name[1]
# or
df_passengers["Name"][1]

It is also possible to modify specific columns of create new columns:

In [None]:
df_passengers['Year of birth'] = 2023 - df_passengers['Age']
df_passengers

Individual rows can be selected based on the index, using `iloc`

In [None]:
df_passengers.iloc[0]
df_passengers.iloc[[0, 2]]

## Statistical information

The summary functions mentionned above can also be applied to individual columns. Try the folowing commands: 

In [None]:
print(df.year.describe())

print(df.AveragePrice.mean())

print(df.region.unique())

To compute the correlation coefficients between pairs of columns:

In [None]:
subdf=df[['AveragePrice', 'Total Volume', 'Total Bags']]  # selecting a subset of the inital df
subdf.corr()

As expected, there is a tight correlation between the total number of avocados sold (`Total Volume`) and the number of bags sold (`Total Bags`). The average price decreases slightly with an increasing volume.

## Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

To select all the men in the above list:

In [None]:
df_passengers.Sex == 'male'

This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:

In [None]:
df_passengers[df_passengers.Sex == 'male']

## Exercise

Add a column to the avocado dataframe indicating the total price for the sale.

Determine how many avocado sales made in the region of Atlanta have reached a price above $700000.

In [None]:
# Your code here

## Grouping and sorting

The `groupby` operation allows to group the data by the values in a specific column. 

We can then find for example the minimum average price per region, or per region and per year:

In [None]:
avocado.groupby('region').AveragePrice.min()

In [None]:
avocado.groupby(['region', 'year']).AveragePrice.min()

The `sort_values()` method can be used to sort the data in many ways:

In [None]:
avocado.sort_values(by='AveragePrice')

`sort_values()` defaults to an ascending sort, where the lowest values go first. However, sometimes we would prefer a descending order:

In [None]:
avocado.sort_values(by='AveragePrice', ascending=False)

Finally, a dataframe can be sorted by more than one column at a time:

In [None]:
avocado.sort_values(by=['type','AveragePrice'])

## Exercise

Compare the mean average price for organic and conventional avocados.

In [None]:
# Your code here

Create a new dataframe containing only the date of the sale, the average price and the region, sorted by date (from the most recent to the oldest one) and by total price (from the cheapeast to the most expensive).

In [None]:
# Your code here

## Concatenation and Merging

**pandas** provides various methods for combining together Series or DataFrame with various kinds of logic.

The ``concat()`` function performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes on the other axes.


In [None]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)


df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)


frames = [df1, df2, df3]

result = pd.concat(frames, keys=["x", "y", "z"])


How to handle the other axes ? 

This can be done in the following two ways:

- Take the union of them all, join='outer'. This is the default option as it results in zero information loss

- Take the intersection, join='inner'

In [None]:
df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
result = pd.concat([df1, df4], axis=1, join="inner")


**pandas** provides a single function, `merge()`, as the entry point for all standard database join operations between DataFrame objects:

![Alt text](image-2.png)

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

result = pd.merge(left, right, how="outer", on=["key1", "key2"])
result

## Exercise

Dowload the following dataset, using Kaggle API:

Name: data science for good kiva crowdfunding
Owner : kiva

Do some exploratory analysis on the loans.csv dataset:
- print the name of the columns and data types using built-in pandas functions
- extract some stats on the loans (counts, average loan, total amount funded), grouped by sector, country and both. 
- Add a column for the relative amount of loan funded wrt the total sum 
- Sort the values based on the total amount funded.

Use the merging method to glue different .csv files:

- do a join with the theme_ids.csv file, on the 'id' key and save the result of the join as a new dataframe
- join the result fo the previous point with the theme_by_region.csv file, this time using multiple keys: region, country, Loan Theme ID, Partner ID



## Windowing

**pandas** contains a set of methods for performing windowing operations - an operation that performs an aggregation over a sliding partition of values. The method functions similarly to the **groupby**  in that Series and DataFrame call the windowing method with necessary parameters and then subsequently call the aggregation function.

**pandas** supports 4 types of windowing operations:

- Rolling window: Generic fixed or variable sliding window over the values.
- Weighted window: Weighted, non-rectangular window 
- Expanding window: Accumulating window over the values.
- Exponentially Weighted window: Accumulating and exponentially weighted window over the values.

In [None]:
s = pd.Series(range(5), index=pd.date_range('2023-01-01', periods=5, freq='1D'))
s
s.rolling(window='2D').sum()



## Exercise 
Download this dataset:

`akashram/indian-summer-over-the-years`

After an Exploratory Data Analysis (EDA), use the window APIs to calculate and plot a median trend of the temperature measurements in India (see 'Median Filter' on Wikipedia).

The window method can be chained after a groupby. Make use of the chaining, to plot a median trend of the temperature per city. 

For the EDA: Pandas supports some APIs for plotting using Matplotlib. Try using `df.hist()` or `df.plot()`, along with additional parameters, to see how to properly and efficiently plot the columns of a Dataframe.