# Analysis and Visualization of Complex Agro-Environmental Data
---
## Introduction to Pandas
The name `pandas` is derived from "panel data" and it is the ideal package to deal with structured data, most typically in the form of relational tables, i.e. a table of columns (fields or variables) that describe a listing, or rows, of data (or observations). In pandas (as in R software), a data table is called a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame), which is defined as a two-dimensional, size-mutable (rows and/or columns may be added or deleted), potentially heterogeneous (may contain string, numeric, boolean and missing data) tabular data. Pandas offers a wide range of tools to explore, clean, and process your data.

The first step will be to [install](https://pandas.pydata.org/docs/getting_started/install.html#) pandas. Then, you will need to import pandas to your work session:

In [5]:
import pandas as pd

We may import tabular data from a *.csv file or a *.xlsx file directly to a DataFrame using, respectively, the `read_csv` and `read_excel` funcions. Let's run an example with the table `penguins_lter.csv`, available on the "examples" folder of our github repository (you may run `help(pd. read_csv)` to check the arguments)

In [8]:
df = pd.read_csv (r"/Users/eleshuk/Documents/GitHub/greends-avcad-2025/people/eleshuk/penguins_lter.csv", # the path to the file ("r" is used when the bars in the path are inverted)
              sep=',', # the list separator in *.csv
              header = 'infer', # automatically infers the header row.
              index_col = None) # if no row indexing will be needed
df.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


In [9]:
type(df)

pandas.core.frame.DataFrame

In [10]:
len(df) # number of rows of the DataFrame

344

### Handling data with pandas

#### Accessing data atributes `dataframe.atribute`


In [11]:
df.values # returns a N dimensional matrix (array)

array([['PAL0708', 1, 'Adelie Penguin (Pygoscelis adeliae)', ..., nan,
        nan, 'Not enough blood for isotopes.'],
       ['PAL0708', 2, 'Adelie Penguin (Pygoscelis adeliae)', ...,
        8.94956, -24.69454, nan],
       ['PAL0708', 3, 'Adelie Penguin (Pygoscelis adeliae)', ...,
        8.36821, -25.33302, nan],
       ...,
       ['PAL0910', 122, 'Gentoo penguin (Pygoscelis papua)', ...,
        8.30166, -26.04117, nan],
       ['PAL0910', 123, 'Gentoo penguin (Pygoscelis papua)', ...,
        8.24246, -26.11969, nan],
       ['PAL0910', 124, 'Gentoo penguin (Pygoscelis papua)', ..., 8.3639,
        -26.15531, nan]], dtype=object)

In [12]:
df.columns # returns a list with the header (variables) names.

Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage',
       'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)',
       'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex',
       'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'],
      dtype='object')

In [13]:
print(df. head(10)) # prints the first 10 rows.

  studyName  Sample Number                              Species  Region  \
0   PAL0708              1  Adelie Penguin (Pygoscelis adeliae)  Anvers   
1   PAL0708              2  Adelie Penguin (Pygoscelis adeliae)  Anvers   
2   PAL0708              3  Adelie Penguin (Pygoscelis adeliae)  Anvers   
3   PAL0708              4  Adelie Penguin (Pygoscelis adeliae)  Anvers   
4   PAL0708              5  Adelie Penguin (Pygoscelis adeliae)  Anvers   
5   PAL0708              6  Adelie Penguin (Pygoscelis adeliae)  Anvers   
6   PAL0708              7  Adelie Penguin (Pygoscelis adeliae)  Anvers   
7   PAL0708              8  Adelie Penguin (Pygoscelis adeliae)  Anvers   
8   PAL0708              9  Adelie Penguin (Pygoscelis adeliae)  Anvers   
9   PAL0708             10  Adelie Penguin (Pygoscelis adeliae)  Anvers   

      Island               Stage Individual ID Clutch Completion  Date Egg  \
0  Torgersen  Adult, 1 Egg Stage          N1A1               Yes  11/11/07   
1  Torgersen  Adul

#### Accessing and subsetting data

##### Variable subsetting

In [14]:
# extract 'Species' column from df
df['Species']

0      Adelie Penguin (Pygoscelis adeliae)
1      Adelie Penguin (Pygoscelis adeliae)
2      Adelie Penguin (Pygoscelis adeliae)
3      Adelie Penguin (Pygoscelis adeliae)
4      Adelie Penguin (Pygoscelis adeliae)
                      ...                 
339      Gentoo penguin (Pygoscelis papua)
340      Gentoo penguin (Pygoscelis papua)
341      Gentoo penguin (Pygoscelis papua)
342      Gentoo penguin (Pygoscelis papua)
343      Gentoo penguin (Pygoscelis papua)
Name: Species, Length: 344, dtype: object

In [15]:
# or
df.Species

0      Adelie Penguin (Pygoscelis adeliae)
1      Adelie Penguin (Pygoscelis adeliae)
2      Adelie Penguin (Pygoscelis adeliae)
3      Adelie Penguin (Pygoscelis adeliae)
4      Adelie Penguin (Pygoscelis adeliae)
                      ...                 
339      Gentoo penguin (Pygoscelis papua)
340      Gentoo penguin (Pygoscelis papua)
341      Gentoo penguin (Pygoscelis papua)
342      Gentoo penguin (Pygoscelis papua)
343      Gentoo penguin (Pygoscelis papua)
Name: Species, Length: 344, dtype: object

In [16]:
# Extract more than one column
df[['Species', 'Culmen Length (mm)']]

Unnamed: 0,Species,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),39.1
1,Adelie Penguin (Pygoscelis adeliae),39.5
2,Adelie Penguin (Pygoscelis adeliae),40.3
3,Adelie Penguin (Pygoscelis adeliae),
4,Adelie Penguin (Pygoscelis adeliae),36.7
...,...,...
339,Gentoo penguin (Pygoscelis papua),
340,Gentoo penguin (Pygoscelis papua),46.8
341,Gentoo penguin (Pygoscelis papua),50.4
342,Gentoo penguin (Pygoscelis papua),45.2


In [19]:
df["Culmen Length (mm)"]

0      39.1
1      39.5
2      40.3
3       NaN
4      36.7
       ... 
339     NaN
340    46.8
341    50.4
342    45.2
343    49.9
Name: Culmen Length (mm), Length: 344, dtype: float64

##### Row subsetting

In [20]:
df[df.Species == "Gentoo penguin (Pygoscelis papua)"] # extract rows containing "Gentoo penguin (Pygoscelis papua)" in 'Species'

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
220,PAL0708,1,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N31A1,Yes,11/27/07,46.1,13.2,211.0,4500.0,FEMALE,7.99300,-25.51390,
221,PAL0708,2,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N31A2,Yes,11/27/07,50.0,16.3,230.0,5700.0,MALE,8.14756,-25.39369,
222,PAL0708,3,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N32A1,Yes,11/27/07,48.7,14.1,210.0,4450.0,FEMALE,8.14705,-25.46172,
223,PAL0708,4,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N32A2,Yes,11/27/07,50.0,15.2,218.0,5700.0,MALE,8.25540,-25.40075,
224,PAL0708,5,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N33A1,Yes,11/18/07,47.6,14.5,215.0,5400.0,MALE,8.23450,-25.54456,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,PAL0910,120,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,,,,
340,PAL0910,121,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,FEMALE,8.24246,-26.11969,


In [21]:
df[df.Species != "Adelie Penguin (Pygoscelis adeliae)"] # extract rows not containing "Adelie penguin (Pygoscelis adeliae)" in 'Species'

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
152,PAL0708,1,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N61A1,No,11/19/07,46.5,17.9,192.0,3500.0,FEMALE,9.03935,-24.30229,
153,PAL0708,2,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N61A2,No,11/19/07,50.0,19.5,196.0,3900.0,MALE,8.92069,-24.23592,
154,PAL0708,3,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N62A1,Yes,11/26/07,51.3,19.2,193.0,3650.0,MALE,9.29078,-24.75570,
155,PAL0708,4,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N62A2,Yes,11/26/07,45.4,18.7,188.0,3525.0,FEMALE,8.64701,-24.62717,
156,PAL0708,5,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N64A1,Yes,11/21/07,52.7,19.8,197.0,3725.0,MALE,9.00642,-24.61867,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,PAL0910,120,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,,,,
340,PAL0910,121,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,FEMALE,8.24246,-26.11969,


In [22]:
df[df['Culmen Length (mm)'] > 40] # extract all rows with Culmen Length > 40 mm

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
9,PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,11/9/07,42.0,20.2,190.0,4250.0,,9.13362,-25.09368,No blood sample obtained for sexing.
12,PAL0708,13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A1,Yes,11/15/07,41.1,17.6,182.0,3200.0,FEMALE,,,Not enough blood for isotopes.
17,PAL0708,18,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N9A2,Yes,11/12/07,42.5,20.7,197.0,4500.0,MALE,8.67538,-25.13993,
19,PAL0708,20,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N10A2,Yes,11/16/07,46.0,21.5,194.0,4200.0,MALE,9.11616,-24.77227,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,PAL0910,119,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A1,No,12/1/09,47.2,13.7,214.0,4925.0,FEMALE,7.99184,-26.20538,
340,PAL0910,121,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,FEMALE,8.24246,-26.11969,


In [23]:
df.sample(n=5) # random sample of 5 rows

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
67,PAL0809,68,Adelie Penguin (Pygoscelis adeliae),Anvers,Biscoe,"Adult, 1 Egg Stage",N30A2,Yes,11/6/08,41.1,19.1,188.0,4100.0,MALE,8.71078,-25.81012,
167,PAL0708,16,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N70A2,Yes,11/22/07,50.5,19.6,201.0,4050.0,MALE,9.8059,-24.7294,
161,PAL0708,10,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N67A2,Yes,11/21/07,51.3,19.9,198.0,3700.0,MALE,8.79581,-24.36088,
193,PAL0809,42,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N74A2,Yes,11/24/08,46.2,17.5,187.0,3650.0,FEMALE,9.61734,-24.66188,
36,PAL0708,37,Adelie Penguin (Pygoscelis adeliae),Anvers,Dream,"Adult, 1 Egg Stage",N24A1,Yes,11/16/07,38.8,20.0,190.0,3950.0,MALE,9.18985,-25.12255,


##### Row and column subsetting

Data subsetting in pandas are often based on the `.loc` and `iloc` methods. 

* The `loc` method uses rows and column names: `.loc[row names , column names]`

* The `iloc` method uses indices instead: `.iloc[row index> , column index]`

In [24]:
df.loc[199,'Species'] # returns 'Species' at row with index = 199 (row 200)

'Chinstrap penguin (Pygoscelis antarctica)'

In [25]:
df.loc[0:4,'Species'] # returns 'Species' from rows 0 to 4 (including 4, as it assumes as a name).

0    Adelie Penguin (Pygoscelis adeliae)
1    Adelie Penguin (Pygoscelis adeliae)
2    Adelie Penguin (Pygoscelis adeliae)
3    Adelie Penguin (Pygoscelis adeliae)
4    Adelie Penguin (Pygoscelis adeliae)
Name: Species, dtype: object

In [26]:
df.loc[:,['Species','Culmen Length (mm)']] # returns 'Species' and 'Culmen Length (mm)' for all rows

Unnamed: 0,Species,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),39.1
1,Adelie Penguin (Pygoscelis adeliae),39.5
2,Adelie Penguin (Pygoscelis adeliae),40.3
3,Adelie Penguin (Pygoscelis adeliae),
4,Adelie Penguin (Pygoscelis adeliae),36.7
...,...,...
339,Gentoo penguin (Pygoscelis papua),
340,Gentoo penguin (Pygoscelis papua),46.8
341,Gentoo penguin (Pygoscelis papua),50.4
342,Gentoo penguin (Pygoscelis papua),45.2


In [27]:
df.iloc[0:4, [2,9]] # returns 'Species' and 'Culmen Length (mm)' for rows 0 to 3.

Unnamed: 0,Species,Culmen Length (mm)
0,Adelie Penguin (Pygoscelis adeliae),39.1
1,Adelie Penguin (Pygoscelis adeliae),39.5
2,Adelie Penguin (Pygoscelis adeliae),40.3
3,Adelie Penguin (Pygoscelis adeliae),


In [28]:
df.iloc[2:4,:] # returns all columns of lines 2 and 3.

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.


##### Summary functions

In [29]:
df['Culmen Length (mm)'].sum() # Sums all the rowws of the selected column

np.float64(15021.3)

In [30]:
df.count() # counts rows with data

studyName              344
Sample Number          344
Species                344
Region                 344
Island                 344
Stage                  344
Individual ID          344
Clutch Completion      344
Date Egg               344
Culmen Length (mm)     342
Culmen Depth (mm)      342
Flipper Length (mm)    342
Body Mass (g)          342
Sex                    334
Delta 15 N (o/oo)      330
Delta 13 C (o/oo)      331
Comments                26
dtype: int64

In [35]:
df.mean(numeric_only=True) # computes the mean value of only the quantitative variables (columns)

Sample Number            63.151163
Culmen Length (mm)       43.921930
Culmen Depth (mm)        17.151170
Flipper Length (mm)     200.915205
Body Mass (g)          4201.754386
Delta 15 N (o/oo)         8.733382
Delta 13 C (o/oo)       -25.686292
dtype: float64

In [36]:
df['Culmen Length (mm)'].mean() # computes the mean of the selected row

np.float64(43.9219298245614)

In [37]:
df.describe() # computes several statistics of only the quantitative variables (columns)

Unnamed: 0,Sample Number,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Delta 15 N (o/oo),Delta 13 C (o/oo)
count,344.0,342.0,342.0,342.0,342.0,330.0,331.0
mean,63.151163,43.92193,17.15117,200.915205,4201.754386,8.733382,-25.686292
std,40.430199,5.459584,1.974793,14.061714,801.954536,0.55177,0.793961
min,1.0,32.1,13.1,172.0,2700.0,7.6322,-27.01854
25%,29.0,39.225,15.6,190.0,3550.0,8.29989,-26.320305
50%,58.0,44.45,17.3,197.0,4050.0,8.652405,-25.83352
75%,95.25,48.5,18.7,213.0,4750.0,9.172123,-25.06205
max,152.0,59.6,21.5,231.0,6300.0,10.02544,-23.78767


In [32]:
# Same thing but now using 'round' function
round(df.describe() ,2)

Unnamed: 0,Sample Number,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Delta 15 N (o/oo),Delta 13 C (o/oo)
count,344.0,342.0,342.0,342.0,342.0,330.0,331.0
mean,63.15,43.92,17.15,200.92,4201.75,8.73,-25.69
std,40.43,5.46,1.97,14.06,801.95,0.55,0.79
min,1.0,32.1,13.1,172.0,2700.0,7.63,-27.02
25%,29.0,39.22,15.6,190.0,3550.0,8.3,-26.32
50%,58.0,44.45,17.3,197.0,4050.0,8.65,-25.83
75%,95.25,48.5,18.7,213.0,4750.0,9.17,-25.06
max,152.0,59.6,21.5,231.0,6300.0,10.03,-23.79


##### Convert to other data formats

In [38]:
dflist = df.values.tolist() # converts DataFrame to a list of lists
print(*dflist, sep="\n") # print 'each list' (defined by *) within the overall list (dflist) in different rows (defined by sep="\n").

['PAL0708', 1, 'Adelie Penguin (Pygoscelis adeliae)', 'Anvers', 'Torgersen', 'Adult, 1 Egg Stage', 'N1A1', 'Yes', '11/11/07', 39.1, 18.7, 181.0, 3750.0, 'MALE', nan, nan, 'Not enough blood for isotopes.']
['PAL0708', 2, 'Adelie Penguin (Pygoscelis adeliae)', 'Anvers', 'Torgersen', 'Adult, 1 Egg Stage', 'N1A2', 'Yes', '11/11/07', 39.5, 17.4, 186.0, 3800.0, 'FEMALE', 8.94956, -24.69454, nan]
['PAL0708', 3, 'Adelie Penguin (Pygoscelis adeliae)', 'Anvers', 'Torgersen', 'Adult, 1 Egg Stage', 'N2A1', 'Yes', '11/16/07', 40.3, 18.0, 195.0, 3250.0, 'FEMALE', 8.36821, -25.33302, nan]
['PAL0708', 4, 'Adelie Penguin (Pygoscelis adeliae)', 'Anvers', 'Torgersen', 'Adult, 1 Egg Stage', 'N2A2', 'Yes', '11/16/07', nan, nan, nan, nan, nan, nan, nan, 'Adult not sampled.']
['PAL0708', 5, 'Adelie Penguin (Pygoscelis adeliae)', 'Anvers', 'Torgersen', 'Adult, 1 Egg Stage', 'N3A1', 'Yes', '11/16/07', 36.7, 19.3, 193.0, 3450.0, 'FEMALE', 8.76651, -25.32426, nan]
['PAL0708', 6, 'Adelie Penguin (Pygoscelis adeli

In [39]:

species = df['Species'].values.tolist() # converts the variable "Species" into a list.
print(species) 

['Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygosce

In [40]:
# or
print(*species, sep="\n")

Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis adeliae)
Adelie Penguin (Pygoscelis a

##### Creating DataFrames

In [41]:
# Create lists
list1 = ['Cropland', 'Forest', 'Grassland ', 'Urban']
list2 = [60 , 20, 5, 10]

In [42]:
# Defining each list as columns
data = list(zip(list1, list2)) # list of tuples
newdf = pd. DataFrame(data=data , columns =['LULC', '2010'])
print(newdf)

         LULC  2010
0    Cropland    60
1      Forest    20
2  Grassland      5
3       Urban    10


In [43]:
# Defining each list as rows
newdf2 = pd.DataFrame([list1 ,list2], index =['LULC','2010'])
print(newdf2)

             0       1           2      3
LULC  Cropland  Forest  Grassland   Urban
2010        60      20           5     10


In [44]:
# transpose DataFrames
print(newdf2.T)

         LULC 2010
0    Cropland   60
1      Forest   20
2  Grassland     5
3       Urban   10


In [45]:
# set first row as the header names
newdf2.columns = newdf2.iloc[0]
newdf2 = newdf2[1:] # remove duplicated row (1)
print(newdf2)

# alternatively: 
# newdf2.rename(columns=newdf2.iloc[0]).drop(newdf2.index[0])
# print(newdf2)


LULC Cropland Forest Grassland  Urban
2010       60     20          5    10


##### Adding and combining

In [46]:
# Adding rows
newdf2.loc[2] = [55, 15, 10, 15] # inserts new row in row 2
newdf2.rename(index={2: '2020'}, inplace=True) # replace the index name by '2020'
print(newdf2)

LULC Cropland Forest Grassland  Urban
2010       60     20          5    10
2020       55     15         10    15


In [47]:
# Adding columns
newdf['2020'] = [55, 15, 10, 15]
print(newdf)

         LULC  2010  2020
0    Cropland    60    55
1      Forest    20    15
2  Grassland      5    10
3       Urban    10    15


In [48]:
# merging DataFrames based on common fields

list3 = ['Cropland', 'Forest', 'Grassland ', 'Water']
list4 = [55 , 15, 10, 2]

data2 = list(zip(list3, list4))
newdf3 = pd.DataFrame(data=data2 , columns =['LULC', '2030']) # produce new DataFrame



In [49]:
# merging matches rows from newdf
pd.merge(newdf, newdf3, how='left', on='LULC') 

Unnamed: 0,LULC,2010,2020,2030
0,Cropland,60,55,55.0
1,Forest,20,15,15.0
2,Grassland,5,10,10.0
3,Urban,10,15,


In [50]:
# merging matches rows from newdf3
pd.merge(newdf, newdf3, how='right', on='LULC') 

Unnamed: 0,LULC,2010,2020,2030
0,Cropland,60.0,55.0,55
1,Forest,20.0,15.0,15
2,Grassland,5.0,10.0,10
3,Water,,,2


In [51]:
# merging matches common rows
pd.merge(newdf, newdf3, how='inner', on='LULC') 

Unnamed: 0,LULC,2010,2020,2030
0,Cropland,60,55,55
1,Forest,20,15,15
2,Grassland,5,10,10


In [52]:
# merging matches all rows
pd.merge(newdf, newdf3, how='outer', on='LULC')

Unnamed: 0,LULC,2010,2020,2030
0,Cropland,60.0,55.0,55.0
1,Forest,20.0,15.0,15.0
2,Grassland,5.0,10.0,10.0
3,Urban,10.0,15.0,
4,Water,,,2.0


## References

pandas documentation, Version 2.2.3, 2024. https://pandas.pydata.org/docs/index.html

pandas cheat sheet. https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf