# Analysis and Visualization of Complex Agro-Environmental Data
---
## Descriptive statistics

As an example we will work on a subset of a database that resulted from integrating information from several river fish biomonitoring programmes accross Europe. This subset includes data for some Mediterranean countries. Each case (rows) corresponds to a fish sampling point. Variables (columns) includes coordinates, country and catchment identifiers, local scale environmental variables, climatic variables, human pressures and fish presence/absence data.

When working with a new dataset, one of the most useful things to do is to begin to visualize the data. By using tables, histograms, box plots, and other visual tools, we can get a better idea of what the data may be trying to tell us, and we can gain insights into the data that we may have not discovered otherwise.

We will be going over how to perform some basic visualisations in Python, and, most importantly, we will learn how to begin exploring data from a graphical perspective.

In [1]:
import pandas as pd
import zipfile
import seaborn as sns # For plotting
import matplotlib.pyplot as plt # For showing plots

#### Import, visualize and summarize table properties

In [2]:
df = pd.read_csv('EFIplus_medit.zip',compression='zip', sep=";")

In [None]:
print(df)

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
list(df.columns)

#### Clean and reajust the dataset

In [4]:
# clean up the dataset to remove unnecessary columns (eg. REG) 
df.drop(df.iloc[:,5:15], axis=1, inplace=True)

# let's rename some columns so that they make sense
df.rename(columns={'Sum of Run1_number_all':'Total_fish_individuals'}, inplace=True) # inplace="True" means that df will be updated

# for sake of consistency, let's also make all column labels of type string
df.columns = list(map(str, df.columns))

In [None]:
# Check data types
pd.options.display.max_rows = 154 # maximum number of rows displayed.
df.dtypes

In [None]:
# Number of values per variable
df.count()

### Handling missing data

In [None]:
# Number of missing values (NaN) per variable
df.isnull().sum()

In [None]:
df2 = df.dropna(how='all') # drops rows when all elements are missing values
df2.info()

In [None]:
df2 = df.dropna(how='all', axis=1) # drops columns when at least one element is a missing value
df2.info()

In [6]:
df2 = df.dropna() # drops rows when at least one element is a missing value
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2822 entries, 18 to 5010
Columns: 154 entries, Site_code to Iberochondrostoma_sp
dtypes: float64(35), int64(113), object(6)
memory usage: 3.3+ MB


### Numerical summaries

In [None]:
mean = df['prec_ann_catch'].mean()
median = df['prec_ann_catch'].median()
print(mean, median)

In [17]:
print(df['Catchment_name'].mode())

0    Ebro
Name: Catchment_name, dtype: object


In [None]:
# A fast way of getting a summary statistics of quantitative data (int or float)
df.describe() # before dropping NaNs

In [None]:
df2.describe() # after dropping NaNs

In [None]:
country_count = pd.crosstab(index = df['Country'], columns='count')
print(country_count)

In [None]:
catchment_count = pd.crosstab(index = df['Catchment_name'], columns='count')
print(catchment_count)

### Plotting qualitative data

Check here: https://seaborn.pydata.org/generated/seaborn.catplot.html

In [None]:
country_count.plot(kind='bar') # pandas function

In [None]:
catchment_count.plot(kind='bar') # pandas function

In [None]:
sns.catplot(x="Country", data=df, kind="count", color="skyblue")

In [None]:
sns.catplot(x="Country", data=df2, kind="count", color="skyblue")

In [None]:
sns.catplot(x="Catchment_name", data=df, kind="count", color="skyblue")
plt.xticks(rotation=90)
plt.show()

In [None]:
sns.catplot(x="Catchment_name", data=df2, kind="count", color="skyblue")
plt.xticks(rotation=90)
plt.show()

In [None]:

colors = sns.color_palette('pastel')
labels = ['France', 'Italy', 'Portugal', 'Spain']
plt.pie(list(country_count.iloc[:,0]), labels=labels, colors = colors, autopct = '%0.0f%%')
plt.show()

### Plotting quantitative data

#### Strip plots
check here: https://seaborn.pydata.org/generated/seaborn.stripplot.html

In [None]:
sns.stripplot(df2, y='prec_ann_catch')
plt.show()

#### Histograms
check here: https://seaborn.pydata.org/generated/seaborn.histplot.html

In [None]:
sns.histplot(df2['prec_ann_catch'], kde = False).set_title("Histogram of precipitation in the upstream catchment")
plt.show()

In [None]:
sns.histplot(
    df["prec_ann_catch"], 
    kde=True,
    stat="density",
    kde_kws=dict(cut=3),
    alpha=.4,
    edgecolor=(1, 1, 1, 0.4),
).set_title("Histogram of precipitation in the upstream catchment")
plt.show()

In [None]:
df_port = df[df['Country']=='Portugal']

sns.histplot(
    df_port["prec_ann_catch"], 
    kde=True,
    stat="density",
    kde_kws=dict(cut=3),
    alpha=.4,
    edgecolor=(1, 1, 1, 0.4),
).set_title("Histogram of precipitation in the upstream catchment")
plt.show()


### Boxplots

Check here: https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
sns.boxplot(df["prec_ann_catch"]).set_title("Box plot of Total Annual Precipitation")
plt.show()

In [None]:
sns.boxplot(df["prec_ann_catch"], whis=0).set_title("Box plot of Total Annual Precipitation")
plt.show()

In [None]:
sns.boxplot(x="Country", y="prec_ann_catch", data=df).set_title("Box plot of Total Annual Precipitation")
plt.show()

### Violin plots

Check here: https://seaborn.pydata.org/generated/seaborn.violinplot.html

In [None]:
sns.violinplot(x="Country", y="prec_ann_catch", data=df).set_title("Violin plot of Total Annual Precipitation")
plt.show()

### Bar plots

Check here: https://seaborn.pydata.org/generated/seaborn.barplot.html

In [None]:
sns.barplot(x="Country", y="prec_ann_catch", data=df)
plt.show()

In [None]:
sns.barplot(data=df, x="Catchment_name", y="prec_ann_catch", estimator="mean")
plt.xticks(rotation=90)
plt.show()