# Pandas for Data Analysis

[**Pandas**](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

We will first introduce some core aspects of pandas using toy data, and then analyse a real data set. First, we should import the pandas package - by convention we give it a shorthand name using `as`. When we want to use the package, we can type `pd.` instead of `pandas.`. 

#### Useful links
* [Data Wrangling cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
* [Python For Data Science cheat sheet](https://www.utc.fr/~jlaforet/Suppl/python-cheatsheets.pdf)

In [2]:
import pandas as pd

### Creating and Reading Data

Two core objects in pandas: **Series** and **DataFrame**.

[**Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) is a one-dimensional (1d) *list* of values. It has a corresponding list of *index* and (possibly) a *name*.

In [3]:
pd.Series([3780, 4120, 4750], index=[2020,2021,2022], name='sales')

2020    3780
2021    4120
2022    4750
Name: sales, dtype: int64

[**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) is a two-dimensional (2d) *table* of values. Each row is a "record" having its *index* and each column is a **Series** having its (column) *name*.

In [4]:
df = pd.DataFrame({'date': pd.date_range('31/05/2022', periods=5, freq='ME'), # freq='ME' means month end frequency
                   'sales': [300.12, 313.28, 330.64, 347.59, 352.11],
                   'department': 'domestic'})
df

Unnamed: 0,date,sales,department
0,2022-05-31,300.12,domestic
1,2022-06-30,313.28,domestic
2,2022-07-31,330.64,domestic
3,2022-08-31,347.59,domestic
4,2022-09-30,352.11,domestic


**Checking the DataFrame index**

- The index is how Pandas labels and organizes the rows in your DataFrame.
- Knowing the index is important because it tells you how you can access, align, or join your data.


In [5]:
df.index

RangeIndex(start=0, stop=5, step=1)

**Setting a column as the index**
- Makes it easier to work with time series data, since dates are now row labels.

In [6]:
df.set_index('date', inplace=True)

**We can also reset the index**

In [7]:
df.reset_index()

Unnamed: 0,date,sales,department
0,2022-05-31,300.12,domestic
1,2022-06-30,313.28,domestic
2,2022-07-31,330.64,domestic
3,2022-08-31,347.59,domestic
4,2022-09-30,352.11,domestic


**Checking the Datafram columns**
- Lists all column labels in the DataFrame.
- Useful for quickly checking the structure of your dataset.

### Exercise

1. Create a dataframe called sales that matches the diagram below

| week       | electronics_sales | furniture_sales |
|------------|-------------------|-----------------|
| 2022-06-05 | 120               | 85              |
| 2022-06-12 | 135               | 90              |
| 2022-06-19 | 128               | 88              |
| 2022-06-26 | 150               | 95              |

2. Display the sales dataframe
3. Set the `week` column as index

In [8]:
import pandas as pd
sales = pd.DataFrame({'week': ['2022-06-05', '2022-06-12', '2022-06-19', '2022-06-26' ], # freq='ME' means month end frequency
                   'electronics_sales': [120, 135, 128, 150],
                   'furniture_sales': [85,90,88,95]})
sales.set_index('week')
print(sales)


         week  electronics_sales  furniture_sales
0  2022-06-05                120               85
1  2022-06-12                135               90
2  2022-06-19                128               88
3  2022-06-26                150               95


More often, DataFrames are created from data files, like **CSV (comma-separated values)** files, using [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

**Let's import our first dataframe**

In [9]:
reviews = pd.read_csv('../Data/wine_reviews.csv', index_col=0)
reviews.reset_index(drop=True, inplace=True)
reviews


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Australia,"Possibly a little sweet, this is a soft, easyg...",,83,5.0,Australia Other,South Eastern Australia,,Joe Czerwinski,@JoeCz,Banrock Station 2006 Chardonnay (South Eastern...,Chardonnay,Banrock Station
1,France,"A soft, almost off dry wine that is full in th...",Réserve,85,12.0,Rhône Valley,Côtes du Rhône,,Roger Voss,@vossroger,Cellier des Dauphins 2015 Réserve Rosé (Côtes ...,Rosé,Cellier des Dauphins
2,Spain,Generic white-fruit aromas of peach and apple ...,Estate Grown & Bottled,86,9.0,Northern Spain,Rueda,,Michael Schachner,@wineschach,Esperanza 2013 Estate Grown & Bottled Verdejo-...,Verdejo-Viura,Esperanza
3,US,This is the winery's best Nebula in years. Whi...,Nebula,87,29.0,California,Paso Robles,Central Coast,,,Midnight 2010 Nebula Cabernet Sauvignon (Paso ...,Cabernet Sauvignon,Midnight
4,US,This is a very rich Pinot whose primary virtue...,Wiley Vineyard,88,40.0,California,Anderson Valley,,,,Harrington 2006 Wiley Vineyard Pinot Noir (And...,Pinot Noir,Harrington
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58482,US,A solid effort from a dependable winery that u...,Winemaker's Reserve,88,35.0,California,Sonoma County,Sonoma,,,Château Souverain 1996 Winemaker's Reserve Cab...,Cabernet Sauvignon,Château Souverain
58483,Greece,"Crushed thyme, pine resin and lemon start this...",Retsina of Attica,86,9.0,Attica,,,Susan Kostrzewa,@suskostrzewa,Kourtaki NV Retsina of Attica Savatiano (Attica),Savatiano,Kourtaki
58484,Italy,"Made from Negroamaro, this opens with aromas o...",,87,15.0,Southern Italy,Salento,,Kerin O’Keefe,@kerinokeefe,Masseria Altemura 2016 Rosato (Salento),Rosato,Masseria Altemura
58485,US,"This big, bold wine has the taste profile of a...",Estate Mae's Block Ravazzi Vineyard,88,32.0,California,Mendocino,,Jim Gordon,@gordone_cellars,Jaxon Keys 2013 Estate Mae's Block Ravazzi Vin...,Zinfandel,Jaxon Keys


In [10]:
reviews.reset_index(drop=True)


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Australia,"Possibly a little sweet, this is a soft, easyg...",,83,5.0,Australia Other,South Eastern Australia,,Joe Czerwinski,@JoeCz,Banrock Station 2006 Chardonnay (South Eastern...,Chardonnay,Banrock Station
1,France,"A soft, almost off dry wine that is full in th...",Réserve,85,12.0,Rhône Valley,Côtes du Rhône,,Roger Voss,@vossroger,Cellier des Dauphins 2015 Réserve Rosé (Côtes ...,Rosé,Cellier des Dauphins
2,Spain,Generic white-fruit aromas of peach and apple ...,Estate Grown & Bottled,86,9.0,Northern Spain,Rueda,,Michael Schachner,@wineschach,Esperanza 2013 Estate Grown & Bottled Verdejo-...,Verdejo-Viura,Esperanza
3,US,This is the winery's best Nebula in years. Whi...,Nebula,87,29.0,California,Paso Robles,Central Coast,,,Midnight 2010 Nebula Cabernet Sauvignon (Paso ...,Cabernet Sauvignon,Midnight
4,US,This is a very rich Pinot whose primary virtue...,Wiley Vineyard,88,40.0,California,Anderson Valley,,,,Harrington 2006 Wiley Vineyard Pinot Noir (And...,Pinot Noir,Harrington
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58482,US,A solid effort from a dependable winery that u...,Winemaker's Reserve,88,35.0,California,Sonoma County,Sonoma,,,Château Souverain 1996 Winemaker's Reserve Cab...,Cabernet Sauvignon,Château Souverain
58483,Greece,"Crushed thyme, pine resin and lemon start this...",Retsina of Attica,86,9.0,Attica,,,Susan Kostrzewa,@suskostrzewa,Kourtaki NV Retsina of Attica Savatiano (Attica),Savatiano,Kourtaki
58484,Italy,"Made from Negroamaro, this opens with aromas o...",,87,15.0,Southern Italy,Salento,,Kerin O’Keefe,@kerinokeefe,Masseria Altemura 2016 Rosato (Salento),Rosato,Masseria Altemura
58485,US,"This big, bold wine has the taste profile of a...",Estate Mae's Block Ravazzi Vineyard,88,32.0,California,Mendocino,,Jim Gordon,@gordone_cellars,Jaxon Keys 2013 Estate Mae's Block Ravazzi Vin...,Zinfandel,Jaxon Keys


### Viewing, Selecting, Assigning & Missing Data

In [11]:
reviews.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58487 entries, 0 to 58486
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                58459 non-null  object 
 1   description            58487 non-null  object 
 2   designation            41548 non-null  object 
 3   points                 58487 non-null  int64  
 4   price                  54404 non-null  float64
 5   province               58459 non-null  object 
 6   region_1               48949 non-null  object 
 7   region_2               22686 non-null  object 
 8   taster_name            46635 non-null  object 
 9   taster_twitter_handle  44376 non-null  object 
 10  title                  58487 non-null  object 
 11  variety                58487 non-null  object 
 12  winery                 58487 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 5.8+ MB


In [12]:
reviews.shape

(58487, 13)

### Selecting Data

Also called **indexing**, it is the most common operation in Pandas. We discuss 4 cases selecting data from a DataFrame:
1. Selecting one **column** (as a Series)
2. Selecting by **label**
3. Selecting by **position**
4. Selecting by **conditions**

We will practice with the wine review DataFrame.

In [13]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Australia,"Possibly a little sweet, this is a soft, easyg...",,83,5.0,Australia Other,South Eastern Australia,,Joe Czerwinski,@JoeCz,Banrock Station 2006 Chardonnay (South Eastern...,Chardonnay,Banrock Station
1,France,"A soft, almost off dry wine that is full in th...",Réserve,85,12.0,Rhône Valley,Côtes du Rhône,,Roger Voss,@vossroger,Cellier des Dauphins 2015 Réserve Rosé (Côtes ...,Rosé,Cellier des Dauphins
2,Spain,Generic white-fruit aromas of peach and apple ...,Estate Grown & Bottled,86,9.0,Northern Spain,Rueda,,Michael Schachner,@wineschach,Esperanza 2013 Estate Grown & Bottled Verdejo-...,Verdejo-Viura,Esperanza
3,US,This is the winery's best Nebula in years. Whi...,Nebula,87,29.0,California,Paso Robles,Central Coast,,,Midnight 2010 Nebula Cabernet Sauvignon (Paso ...,Cabernet Sauvignon,Midnight
4,US,This is a very rich Pinot whose primary virtue...,Wiley Vineyard,88,40.0,California,Anderson Valley,,,,Harrington 2006 Wiley Vineyard Pinot Noir (And...,Pinot Noir,Harrington


**1. Selecting a column**

In [14]:
reviews.country

0        Australia
1           France
2            Spain
3               US
4               US
           ...    
58482           US
58483       Greece
58484        Italy
58485           US
58486        Spain
Name: country, Length: 58487, dtype: object

In [15]:
reviews.reset_index(drop=True)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Australia,"Possibly a little sweet, this is a soft, easyg...",,83,5.0,Australia Other,South Eastern Australia,,Joe Czerwinski,@JoeCz,Banrock Station 2006 Chardonnay (South Eastern...,Chardonnay,Banrock Station
1,France,"A soft, almost off dry wine that is full in th...",Réserve,85,12.0,Rhône Valley,Côtes du Rhône,,Roger Voss,@vossroger,Cellier des Dauphins 2015 Réserve Rosé (Côtes ...,Rosé,Cellier des Dauphins
2,Spain,Generic white-fruit aromas of peach and apple ...,Estate Grown & Bottled,86,9.0,Northern Spain,Rueda,,Michael Schachner,@wineschach,Esperanza 2013 Estate Grown & Bottled Verdejo-...,Verdejo-Viura,Esperanza
3,US,This is the winery's best Nebula in years. Whi...,Nebula,87,29.0,California,Paso Robles,Central Coast,,,Midnight 2010 Nebula Cabernet Sauvignon (Paso ...,Cabernet Sauvignon,Midnight
4,US,This is a very rich Pinot whose primary virtue...,Wiley Vineyard,88,40.0,California,Anderson Valley,,,,Harrington 2006 Wiley Vineyard Pinot Noir (And...,Pinot Noir,Harrington
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58482,US,A solid effort from a dependable winery that u...,Winemaker's Reserve,88,35.0,California,Sonoma County,Sonoma,,,Château Souverain 1996 Winemaker's Reserve Cab...,Cabernet Sauvignon,Château Souverain
58483,Greece,"Crushed thyme, pine resin and lemon start this...",Retsina of Attica,86,9.0,Attica,,,Susan Kostrzewa,@suskostrzewa,Kourtaki NV Retsina of Attica Savatiano (Attica),Savatiano,Kourtaki
58484,Italy,"Made from Negroamaro, this opens with aromas o...",,87,15.0,Southern Italy,Salento,,Kerin O’Keefe,@kerinokeefe,Masseria Altemura 2016 Rosato (Salento),Rosato,Masseria Altemura
58485,US,"This big, bold wine has the taste profile of a...",Estate Mae's Block Ravazzi Vineyard,88,32.0,California,Mendocino,,Jim Gordon,@gordone_cellars,Jaxon Keys 2013 Estate Mae's Block Ravazzi Vin...,Zinfandel,Jaxon Keys


**2. Selecting by label**

Here "label" means the "row names" `index` and the "column names" `columns`.

Use `loc[]` to access part of the DataFrame by row and column **labels**. Note the `[]` instead of `()`.

In [16]:
reviews.loc[1, "country"]

'France'

We can use `:` inside `.loc[]` to access either all rows for a given column(s)

In [17]:
reviews.reset_index(drop=True)
reviews.loc[:, ['country', 'province' ]]


Unnamed: 0,country,province
0,Australia,Australia Other
1,France,Rhône Valley
2,Spain,Northern Spain
3,US,California
4,US,California
...,...,...
58482,US,California
58483,Greece,Attica
58484,Italy,Southern Italy
58485,US,California


We can rearrange our `.loc[]` to obtain all columns for specif range of rows too!

We can use the `:` inside `.loc[]` to obtain a range from a specific point until the end of the dataframe 

**3. Selecting by position**

Here "position" means the *numerical location*, i.e., row number and column number (both start from 0 per Python convention), in the DataFrame.

Use `iloc[]` to access part of the DataFrame by row and column numbers. Note the `[]` instead of `()`.

In [18]:
reviews.iloc[1, 2]

'Réserve'

In [19]:
reviews.iloc[:, 4]

0         5.0
1        12.0
2         9.0
3        29.0
4        40.0
         ... 
58482    35.0
58483     9.0
58484    15.0
58485    32.0
58486    10.0
Name: price, Length: 58487, dtype: float64

**4. Selecting by conditions**

This is also called **boolean indexing**, usually used to select *rows* satisfying certain conditions.

How can we satisfy more than one condition?

We still need `[]` as we are passing a list of labels or conditions.

We can seprate our conditions using the `&`.

We will also need wrap our conditions using `()` due to precedence.

In python `==` have a higher precedence than bitwise operators like `&`.

So we need any `==` to be evaluated first and then combined.

We can also satisfy 2 conditions within a column using `.isin()` and passing it a list

Missing values can also be considered as conditions

#### Missing Data

Detect missing data `np.nan`:

Filling in missing data

Drop missing data

#### Exercise

Create a "sub"-DataFrame from `reviews` that contains the `country`, `province`, `region_1` and `region_2` columns with index labels `10`, `750` and `1200`.

In [21]:
print(reviews)

         country                                        description  \
0      Australia  Possibly a little sweet, this is a soft, easyg...   
1         France  A soft, almost off dry wine that is full in th...   
2          Spain  Generic white-fruit aromas of peach and apple ...   
3             US  This is the winery's best Nebula in years. Whi...   
4             US  This is a very rich Pinot whose primary virtue...   
...          ...                                                ...   
58482         US  A solid effort from a dependable winery that u...   
58483     Greece  Crushed thyme, pine resin and lemon start this...   
58484      Italy  Made from Negroamaro, this opens with aromas o...   
58485         US  This big, bold wine has the taste profile of a...   
58486      Spain  Zingy and sort of floral on the nose, but fair...   

                               designation  points  price         province  \
0                                      NaN      83    5.0  Australia 

Create a "sub"-DataFrame from `reviews` that contains all reviews with at least 95 points for wines from oceanian countries (Australia and New Zealand).

In [26]:
oceanic_wine = reviews.loc[(reviews['points'] >= 95) & (reviews.country.isin(["Australia", "New Zealand"]))]
oceanic_wine

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
576,Australia,This prodigious wine showcases Barossa's abili...,The Relic,98,135.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Standish 2006 The Relic Shiraz (Barossa Valley),Shiraz,Standish
1959,Australia,This Cabernet equivalent to Grange has explode...,Bin 707,95,500.0,South Australia,South Australia,,Joe Czerwinski,@JoeCz,Penfolds 2014 Bin 707 Cabernet Sauvignon (Sout...,Cabernet Sauvignon,Penfolds
10872,Australia,Full-bodied and plush yet vibrant and imbued w...,The Factor,98,125.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Torbreck 2013 The Factor Shiraz (Barossa Valley),Shiraz,Torbreck
11474,New Zealand,"This blend of Cabernet Sauvignon (62.5%), Merl...",SQM Gimblett Gravels Cabernets/Merlot,95,79.0,Hawke's Bay,,,Joe Czerwinski,@JoeCz,Squawking Magpie 2014 SQM Gimblett Gravels Cab...,Bordeaux-style Red Blend,Squawking Magpie
11762,Australia,The Factor is always one of Torbreck's biggest...,The Factor,95,125.0,South Australia,Barossa,,Joe Czerwinski,@JoeCz,Torbreck 2012 The Factor Shiraz (Barossa),Shiraz,Torbreck
14670,Australia,RWT (unromantically derived from “Red Wine Tri...,RWT,96,150.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Penfolds 2009 RWT Shiraz (Barossa Valley),Shiraz,Penfolds
20753,Australia,The fruit for this offering comes from the Gre...,R Reserve,95,105.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Kilikanoon 2009 R Reserve Shiraz (Barossa Valley),Shiraz,Kilikanoon
22515,Australia,The Taylor family selected Clare Valley for it...,St. Andrews Single Vineyard Release,95,60.0,South Australia,Clare Valley,,Joe Czerwinski,@JoeCz,Wakefield 2013 St. Andrews Single Vineyard Rel...,Shiraz,Wakefield
25802,Australia,"Seamless luxury from stem to stern, this ‘baby...",RWT,95,70.0,South Australia,Barossa Valley,,,,Penfolds 1998 RWT Shiraz (Barossa Valley),Shiraz,Penfolds
27293,Australia,Winemaker Dave Powell is no longer with Torbre...,RunRig,96,225.0,South Australia,Barossa Valley,,Joe Czerwinski,@JoeCz,Torbreck 2007 RunRig Shiraz-Viognier (Barossa ...,Shiraz-Viognier,Torbreck


### Summary Functions

Summary functions allow us to quickly describe and understand a dataset by computing key statistics. Common examples include `.mean()`, `.median()`, `.min()`, `.max()`, and `.sum()` for numerical data, as well as `.value_counts()` for categorical data. These functions can be applied to entire DataFrames or specific columns, giving us insights such as average values, distributions, and totals. Using `.describe()` provides a convenient overview of multiple summary statistics at once.


In [27]:
reviews.describe()

Unnamed: 0,points,price
count,58487.0,54404.0
mean,88.44244,35.537222
std,3.052034,42.727141
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,2500.0


In [28]:
reviews.country.unique()

array(['Australia', 'France', 'Spain', 'US', 'Italy', 'Portugal',
       'Germany', 'Chile', 'Argentina', 'South Africa', 'Georgia',
       'Austria', 'New Zealand', 'Uruguay', 'Turkey', 'Canada',
       'Bulgaria', 'Israel', 'Greece', 'Hungary', 'Ukraine', 'England',
       'Moldova', 'Croatia', nan, 'Mexico', 'Romania', 'Macedonia',
       'Morocco', 'Slovenia', 'Brazil', 'Lebanon', 'Luxembourg', 'Cyprus',
       'Peru', 'Czech Republic', 'Serbia', 'India', 'Armenia', 'Egypt',
       'Bosnia and Herzegovina', 'Switzerland'], dtype=object)

In [29]:
reviews.country.value_counts()

country
US                        24452
France                     9998
Italy                      8857
Spain                      2991
Portugal                   2620
Chile                      1981
Argentina                  1685
Austria                    1475
Australia                  1040
Germany                     980
New Zealand                 640
South Africa                616
Israel                      218
Greece                      213
Canada                      124
Hungary                      71
Bulgaria                     60
Romania                      52
Turkey                       49
Uruguay                      46
Georgia                      39
Croatia                      38
England                      33
Slovenia                     30
Mexico                       27
Moldova                      24
Brazil                       23
Lebanon                      21
Morocco                      14
Peru                          9
Cyprus                        5


For numerical columns, we can obtain the mean, median, min, max and sum

Useful for obtaining quick statistics

In [None]:
print("The mean is: ", reviews.price.mean())



The mean is:  35.537221527828834


## All Done!