# Introduction to Pandas

Pandas is a popular open-source data manipulation and analysis library for Python. It provides a fast and efficient way to work with structured data in various formats, including CSV, Excel, SQL databases, and more. In this Jupyter notebook, we will explore some of the key features and functionalities of pandas for data exploration and analysis.

# Loading Data


<img style="float: right;" src="./images/paul-carroll-Y-nyDv3TWm0-unsplash.jpg" width="33%">

# Palmer Penguins


This dataset contains information about penguins from three different species that were observed on the Palmer Archipelago, Antarctica. The data includes information about the penguins' species, island, bill length and depth, flipper length, body mass, and sex.

The data originally appeared in:

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

<sub>Photo: Paul Carroll via unslash: https://unsplash.com/photos/Y-nyDv3TWm0 </sub>

In [3]:
import pandas as pd
import numpy as np

# Load the Palmer Penguins dataset
penguins = pd.read_csv('palmer_penguins.csv')
penguins.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


The penguins object is a pandas DataFrame, which is a two-dimensional labeled data structure with columns of potentially different data types.

If we wanted to load a tab-separated values file (tsv), we would need to specify the column delimiter (tab):

```python
penguins = pd.read_csv('palmer_penguins.tsv', sep='\t')
```

# Exploring Dataframe

Once we have loaded data into our pandas DataFrame, we can check it's contents. Here are some of the most commonly used pandas methods for exploring data:

### head() and tail()
The head() and tail() methods are used to display the first few and last few rows of a DataFrame, respectively. By default, they display the first and last 5 rows of the DataFrame.

In [4]:
# Display the first 10 rows of the DataFrame
penguins.head(10)

# Display the last 3 rows of the DataFrame
penguins.tail(3)


Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009
343,344,Chinstrap,Dream,50.2,18.7,198.0,3775.0,female,2009


### shape
The shape attribute returns a tuple with the number of rows and columns in the DataFrame.

In [5]:
# Display the shape of the DataFrame
penguins.shape

(344, 9)

### info()
The info() method displays information about the DataFrame, such as the number of non-null values in each column and the data type of each column.

*Hint: One thing to look out for here is the datatype of columns that contain numerical data: if these columns are shown as Dtype 'object', they may have been loaded with the numbers as strings.*


In [6]:
# Display information about the DataFrame
penguins.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


### describe()
The describe() method generates descriptive statistics of the DataFrame, such as the mean, standard deviation, and quartiles of each numerical column.

In [7]:
# Generate descriptive statistics of the DataFrame
penguins.describe()


Unnamed: 0,rowid,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,344.0,342.0,342.0,342.0,342.0,344.0
mean,172.5,43.92193,17.15117,200.915205,4201.754386,2008.02907
std,99.448479,5.459584,1.974793,14.061714,801.954536,0.818356
min,1.0,32.1,13.1,172.0,2700.0,2007.0
25%,86.75,39.225,15.6,190.0,3550.0,2007.0
50%,172.5,44.45,17.3,197.0,4050.0,2008.0
75%,258.25,48.5,18.7,213.0,4750.0,2009.0
max,344.0,59.6,21.5,231.0,6300.0,2009.0


### value_counts()

The value_counts() method creates a frequency table by counting the number of occurrences of each unique value in a column of the DataFrame. 

In [8]:
# Count the number of occurrences of each unique value in the 'species' column
penguins['species'].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

### crosstab()
The crosstab() function is used to compute a contingeny table (cross-tabulation) of two or more factors. It shows the frequency distribution of variables in a tabular format.


In [9]:
# Create a contingency table of the 'species' and 'island' columns
pd.crosstab(penguins['species'], penguins['island'])

island,Biscoe,Dream,Torgersen
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelie,44,56,52
Chinstrap,0,68,0
Gentoo,124,0,0


# Selecting Data

In pandas, we can select specific rows and columns of a DataFrame using the following methods:

### Accessing individual columns of data

To access a single column in the dataframe, you can use the following syntax:

```python
df['column_name']
```

For example, to access the 'species' column in the Palmer Penguins dataset, you can use the following code:

```python
species = df['species']
print(species)
```
This will print the 'species' column of the dataframe. You can also access multiple columns by passing in a list of column names:

```python
columns = ['species', 'island', 'bill_length_mm']
subset = df[columns]
print(subset)
```
This will print a subset of the dataframe containing the 'species', 'island', and 'bill_length_mm' columns.



### Accessing individual Rows
#### loc[ ]

The loc[ ] method is used to select rows and columns by label. We can use it to select specific rows and columns of the DataFrame. You can combine referencing by index (i.e. rows - in the example below indiced 0 to 4 are elected) with selecting individual columns. 

In [10]:
# Select the first 5 rows and the 'species' and 'island' columns
penguins.loc[0:4, ['species', 'island']]


Unnamed: 0,species,island
0,Adelie,Torgersen
1,Adelie,Torgersen
2,Adelie,Torgersen
3,Adelie,Torgersen
4,Adelie,Torgersen


### iloc[ ]
The iloc[ ] method is used to select rows and columns by integer position. We can use it to select specific rows and columns of the DataFrame.

In [11]:
# Select the first 5 rows and the first 3 columns
penguins.iloc[0:5, 0:3]


Unnamed: 0,rowid,species,island
0,1,Adelie,Torgersen
1,2,Adelie,Torgersen
2,3,Adelie,Torgersen
3,4,Adelie,Torgersen
4,5,Adelie,Torgersen


### Filtering Data

We can also filter the DataFrame to select rows based on certain conditions using boolean indexing.

In [12]:
# Filter the DataFrame to select rows where the 'body_mass_g' column is greater than 4000
penguins[penguins['body_mass_g'] > 4000]

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
7,8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
9,10,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
14,15,Adelie,Torgersen,34.6,21.1,198.0,4400.0,male,2007
17,18,Adelie,Torgersen,42.5,20.7,197.0,4500.0,male,2007
19,20,Adelie,Torgersen,46.0,21.5,194.0,4200.0,male,2007
...,...,...,...,...,...,...,...,...,...
321,322,Chinstrap,Dream,50.8,18.5,201.0,4450.0,male,2009
323,324,Chinstrap,Dream,49.0,19.6,212.0,4300.0,male,2009
329,330,Chinstrap,Dream,50.7,19.7,203.0,4050.0,male,2009
333,334,Chinstrap,Dream,49.3,19.9,203.0,4050.0,male,2009


### query()
The query() method is used to filter rows of a DataFrame based on a query expression. The query expression is a string that can contain variables, comparison operators, and logical operators.

In [13]:
# Filter the DataFrame to select rows where the 'body_mass_g' column is greater than 4000 and the 'species' column is equal to 'Gentoo'
penguins.query("body_mass_g > 4000 and species == 'Gentoo'")

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
152,153,Gentoo,Biscoe,46.1,13.2,211.0,4500.0,female,2007
153,154,Gentoo,Biscoe,50.0,16.3,230.0,5700.0,male,2007
154,155,Gentoo,Biscoe,48.7,14.1,210.0,4450.0,female,2007
155,156,Gentoo,Biscoe,50.0,15.2,218.0,5700.0,male,2007
156,157,Gentoo,Biscoe,47.6,14.5,215.0,5400.0,male,2007
...,...,...,...,...,...,...,...,...,...
270,271,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,female,2009
272,273,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,female,2009
273,274,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,male,2009
274,275,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,female,2009


*n.b. When using the query() method in Pandas, you can include variables in the query expression by prefixing the variable name with @. This is necessary to distinguish the variable name from a string value.*

# Manipulating Data

Once we have selected the data we need, we can start manipulating the data to prepare it for analysis. Here are some of the most commonly used pandas methods for data manipulation:

### dropna()
The dropna() method is used to remove rows or columns with missing values from the DataFrame.

In [14]:
# Remove rows with missing values from the DataFrame
penguins.dropna(inplace=True)


### fillna()
The fillna() method is used to fill missing values in the DataFrame with a specified value.

In [15]:
# Fill missing values in the 'body_mass_g' column with the mean value
mean_body_mass = penguins['body_mass_g'].mean()
penguins['body_mass_g'].fillna(mean_body_mass, inplace=True)


### groupby()
The groupby() method is used to group the DataFrame by one or more columns and apply a function to each group. We can use it to calculate aggregate statistics for each group.



In [16]:
# Group the DataFrame by the 'species' column and calculate the mean value of the 'body_mass_g' column for each species
penguins.groupby('species')['body_mass_g'].mean()


species
Adelie       3706.164384
Chinstrap    3733.088235
Gentoo       5092.436975
Name: body_mass_g, dtype: float64

# apply ()

The .apply() method in Pandas is a powerful tool for transforming data in a DataFrame. It applies a function to each row or column of a DataFrame and returns a new DataFrame with the transformed data. In this tutorial, we will use the penguins dataset to illustrate how to use the .apply() method in Pandas, including how to specify function parameters.

### Using .apply()
Suppose we want to create a new column in the penguins DataFrame called "body_mass_kg" that contains the body mass of each penguin in kilograms, instead of grams as in the original dataset. We can use the .apply() method to apply a function to each row of the DataFrame to convert the body mass from grams to kilograms.



In [17]:
# Define a function to convert grams to kilograms
def grams_to_kg(x):
    return x / 1000

# Apply the function to the "Body Mass (g)" column and store the result in a new column called "body_mass_kg"
penguins['body_mass_kg'] = penguins['body_mass_g'].apply(grams_to_kg)

# Display the first few rows of the DataFrame
print(penguins.head())

   rowid species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      1  Adelie  Torgersen            39.1           18.7              181.0   
1      2  Adelie  Torgersen            39.5           17.4              186.0   
2      3  Adelie  Torgersen            40.3           18.0              195.0   
4      5  Adelie  Torgersen            36.7           19.3              193.0   
5      6  Adelie  Torgersen            39.3           20.6              190.0   

   body_mass_g     sex  year  body_mass_kg  
0       3750.0    male  2007          3.75  
1       3800.0  female  2007          3.80  
2       3250.0  female  2007          3.25  
4       3450.0  female  2007          3.45  
5       3650.0    male  2007          3.65  


The apply() method applies the grams_to_kg() function to each value in the "Body Mass (g)" column and returns a new Series with the transformed data. We then store this new Series in a new column called "body_mass_kg" in the penguins DataFrame using the assignment operator (=).




# Using the unique method

The unique method can be used to identify unique values in a column of a Pandas DataFrame. Here's an example of how to use the unique method with the species column in the penguins dataset:

In [18]:
unique_species = penguins['species'].unique()

print(unique_species)


['Adelie' 'Gentoo' 'Chinstrap']


Note that the unique method returns the unique values in the order in which they appear in the original DataFrame. If you want the unique values sorted in alphabetical order, you can use the sort_values method before calling unique:

In [19]:
unique_species_sorted = penguins['species'].sort_values().unique()

print(unique_species_sorted)


['Adelie' 'Chinstrap' 'Gentoo']


# Using the tolist method

The tolist method can be used to convert a Pandas Series to a Python list. This can be usefu when you want to use data from within a dataframe in regular Python code.  

Here's an example of how to use the tolist method with the island column in the penguins dataset:

In [20]:
island_list = penguins['island'].tolist()

print(island_list)

['Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', '

Note that the tolist method returns the values in the same order as they appear in the original DataFrame. If you want the values sorted in a particular order, you can use the sort_values method before calling tolist:

In [22]:
island_list_sorted = penguins['island'].sort_values().tolist()

print(island_list_sorted)


['Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe',