## Pandas

In [4]:
import pandas as pd

**pandas**

Pandas is a Python package that stores and manipulates 2-dimensional datasets.

**dataframe / DataFrame**

Pandas represents datasets with a dataframe object, of data type DataFrame, which consists of rows and columns.

**index / columns**

A dataframe's row labels are known as the index and column labels are known as the columns.

### pandas data table representation

<img src="img/panda.png" width=400 height=400 />

In [2]:
# Load the pandas and numpy packages
import pandas as pd
import numpy as np

# Create a dataframe with pandas DataFrame() constructor
sample_df = pd.DataFrame(
   data=[ ['abc', 3.3, 28, True],
          ['xyz', -.55, 0, False] ], 
   columns=['Label1', 'Label2', 'Label3', 'Label4'],
   index=[0, 1])

# Output the dataframe
print(sample_df)

# Output the dataframe's shape
print(f'\nsample_df shape: {sample_df.shape}')

  Label1  Label2  Label3  Label4
0    abc    3.30      28    True
1    xyz   -0.55       0   False

sample_df shape: (2, 4)



- When selecting a single column of a pandas DataFrame, the result is a pandas Series

In [7]:
sample_df["Label1"]

0    abc
1    xyz
Name: Label1, dtype: object

- You can create a Series from scratch as well:

In [8]:
ages = pd.Series([22, 35, 58], name="Age")

In [9]:
ages

0    22
1    35
2    58
Name: Age, dtype: int64

- A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

### Subsetting data

Subsetting data involves choosing specific rows and columns from a dataframe according to labels, indices, and slices (a range between 2 indices)

In [4]:
%%writefile country_subset.csv

Name,Continent,Population
Afghanistan,Asia,22720000
Albania,Europe,3401200
Algeria,Africa,31471000
American Samoa,Oceania,68000
Andorra,Europe,78000
Angola,Africa,12878000
Anguilla,North America,8000
Antarctica,Antarctica,0
Antigua and Barbuda,North America,68000
Argentina,South America,37032000

Overwriting country_subset.csv


In [5]:
import pandas as pd

# Load the country.csv data
country = pd.read_csv('country_subset.csv')

# Output the country dataframe
print('country dataframe:')
print(country)

country dataframe:
                  Name      Continent  Population
0          Afghanistan           Asia    22720000
1              Albania         Europe     3401200
2              Algeria         Africa    31471000
3       American Samoa        Oceania       68000
4              Andorra         Europe       78000
5               Angola         Africa    12878000
6             Anguilla  North America        8000
7           Antarctica     Antarctica           0
8  Antigua and Barbuda  North America       68000
9            Argentina  South America    37032000


In [6]:
# Select the 'Name' column and output the list of country names
print('List of country names:')
print(country['Name'], '\n')

List of country names:
0            Afghanistan
1                Albania
2                Algeria
3         American Samoa
4                Andorra
5                 Angola
6               Anguilla
7             Antarctica
8    Antigua and Barbuda
9              Argentina
Name: Name, dtype: object 



In [7]:
# Add an extra bracket to output the country names as a dataframe
print('Dataframe of country names:')
print(country[['Name']], '\n')

Dataframe of country names:
                  Name
0          Afghanistan
1              Albania
2              Algeria
3       American Samoa
4              Andorra
5               Angola
6             Anguilla
7           Antarctica
8  Antigua and Barbuda
9            Argentina 



In [8]:
# Select and output the 'Name' and 'Continent' columns
print('Name and Continent columns:')
print(country[['Name','Continent']], '\n')

Name and Continent columns:
                  Name      Continent
0          Afghanistan           Asia
1              Albania         Europe
2              Algeria         Africa
3       American Samoa        Oceania
4              Andorra         Europe
5               Angola         Africa
6             Anguilla  North America
7           Antarctica     Antarctica
8  Antigua and Barbuda  North America
9            Argentina  South America 



In [9]:
# Select the element in row 0 and column 1
print('Element in row 0 and column 1:')
print(country.iloc[0,1], '\n')

Element in row 0 and column 1:
Asia 



In [14]:
# Select and output rows 0 and 1 and column 1
print('Rows 0 and 1 and column 1:')
print(country.iloc[0:2,1], '\n')

Rows 0 and 1 and column 1:
0      Asia
1    Europe
Name: Continent, dtype: object 



In [17]:
# Select and output all rows before row 7 and columns 1 through 2
print('All rows before row 7 and columns 1 through 2:')
print(country.iloc[:7,1:3], '\n')

All rows before row 7 and columns 1 through 2:
       Continent  Population
0           Asia    22720000
1         Europe     3401200
2         Africa    31471000
3        Oceania       68000
4         Europe       78000
5         Africa    12878000
6  North America        8000 



In [18]:
# Select and output rows 2 through 9 and all columns from column 1 onwards using iloc
print('Rows 2 through 9 and all columns from column 1: ')
print(country.iloc[2:10,1:], '\n')

Rows 2 through 9 and all columns from column 1: 
       Continent  Population
2         Africa    31471000
3        Oceania       68000
4         Europe       78000
5         Africa    12878000
6  North America        8000
7     Antarctica           0
8  North America       68000
9  South America    37032000 



In [19]:
# Select and output rows 2 through 9 and the Continent and Population columns using loc
print('Rows 2 through 9 and the Continent and Population columns: ')
print(country.loc[2:9, ['Continent','Population']])

Rows 2 through 9 and the Continent and Population columns: 
       Continent  Population
2         Africa    31471000
3        Oceania       68000
4         Europe       78000
5         Africa    12878000
6  North America        8000
7     Antarctica           0
8  North America       68000
9  South America    37032000


In [21]:
country["Population"].max()

37032000

In [17]:
print(country)

                  Name      Continent  Population
0          Afghanistan           Asia    22720000
1              Albania         Europe     3401200
2              Algeria         Africa    31471000
3       American Samoa        Oceania       68000
4              Andorra         Europe       78000
5               Angola         Africa    12878000
6             Anguilla  North America        8000
7           Antarctica     Antarctica           0
8  Antigua and Barbuda  North America       68000
9            Argentina  South America    37032000


In [19]:
country['Continent'].value_counts()

Continent
Europe           2
Africa           2
North America    2
Asia             1
Oceania          1
Antarctica       1
South America    1
Name: count, dtype: int64