<a href="https://colab.research.google.com/github/mfernandes61/py-dropin-session/blob/main/Carp_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Carpentries Python course as a NoteBook.  
Created as teaching aid by Mark Fernandes.  
University of Cambridge.  


This document is a review of useful concepts from the Carpentries course materials.   


In the next cell **!** allows us to run external programs from a code cell e.g. to install a required Python package.   

In [None]:
!pip install plotnine

Saving simple data items in Python data types.   

In [None]:
text = "Data Carpentry"  # An example of assigning a value to a new text variable,
                         # also known as a string data type in Python
number = 42              # An example of assigning a numeric value, or an integer data type
pi_value = 3.1415        # An example of assigning a floating point value (the float data type)

In [None]:
type(text)

In [None]:
type(number)

In [None]:
print(text)

In [None]:
text

Carrying out arithmetic and performing comparisons.   

In [None]:
6 * 7  # Multiplication

In [None]:
3 > 4 # Logical test

## Moving on to more sophisticated (and useful) Python data structures.   

In [None]:
# Lists
numbers = [1, 2, 3]
numbers[0]

In [None]:
# for loop & list
for num in numbers:
    print(num)


In [None]:
# Append to list
numbers.append(4)
print(numbers)

In [None]:
# Tuples use parentheses
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')

# Note: lists use square brackets
a_list = [1, 2, 3]

# length function
len(a_tuple)

**Key Differences between List and Tuple**

1	Lists are mutable(can be modified).	Tuples are immutable(cannot be modified).   
2	Iteration over lists is time-consuming.	Iterations over tuple is faster.  
3	Lists are better for performing operations, such as insertion and deletion.	Tuples are more suitable for accessing elements efficiently.   
4	Lists consume more memory.	Tuples consumes less memory.  
5	Lists have several built-in methods.	Tuples have fewer built-in methods.   
6	Lists are more prone to unexpected changes and errors.	Tuples, being immutable are less error prone.   


In [None]:
# Dictionaries
translation = {'one': 'first', 'two': 'second'}
translation['one']

In [None]:
rev = {'first': 'one', 'second': 'two'}
print(rev['first'])

# add to dictionary
rev['third'] = 'three'
rev

In [None]:
# Loop through dictionaries - method 1
print('Method 1')
for key, value in rev.items():
    print(key, '->', value)

# Loop through dictionaries - method 2
print('Method 2')
for key in rev.keys():
    print(key, '->', rev[key])

In [None]:
# Example of function
def add_function(a, b):
    result = a + b
    return result

z = add_function(20, 22)
print(z)

**Carpentries data analysis example**.   
Creating the project directory structure and getting the data:   


In [None]:
!mkdir data
!mkdir data_output

In [None]:
!wget -O data/surveys.csv https://ndownloader.figshare.com/files/2292172
!wget -O data/species.csv https://raw.githubusercontent.com/datacarpentry/python-ecology-lesson/main/episodes/data/species.csv
!wget -O data/speciesSubset.csv https://raw.githubusercontent.com/datacarpentry/python-ecology-lesson/main/episodes/data/speciesSubset.csv

Load in analysis packages and use them to read the data from the downloaded file into a Python data frame.   

In [None]:
import pandas as pd
# Note that pd.read_csv is used because we imported pandas as pd
surveys_df = pd.read_csv("data/surveys.csv")

print(surveys_df.head()) # The head() method displays the first several lines of a file.

print(type(surveys_df))

print(surveys_df.dtypes)

In [None]:
# Look at the column names
surveys_df.columns

In [None]:
# list all the unique species names
pd.unique(surveys_df['species_id'])

In [None]:
# Basic summary statistics of weight column
surveys_df['weight'].describe()

In [None]:
# using groupby and describe
# Group data by sex
grouped_data = surveys_df.groupby('sex')

# Summary statistics for all numeric columns by sex
print(grouped_data.describe())
# Provide the mean for each numeric column by sex
print(grouped_data.mean(numeric_only=True))

In [None]:
# Count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)
# Just the DO ones
print(surveys_df.groupby('species_id')['record_id'].count()['DO'])

In [None]:
# Multiply all weight values by 2
surveys_df['weight']*2

In [None]:
# Quick plot using matplotlib
# Make sure figures appear inline in Ipython Notebook
%matplotlib inline
# Create a quick bar chart
species_counts.plot(kind='bar');

In [None]:
# plot how many animals captured in each site
total_count = surveys_df.groupby('plot_id')['record_id'].nunique()
# Let's plot that too
total_count.plot(kind='bar');

In [None]:
# TIP: use the .head() method we saw earlier to make output shorter
# Method 1: select a 'subset' of the data using the column name
surveys_df['species_id']

# Method 2: use the column name as an 'attribute'; gives the same output
surveys_df.species_id

# Creates an object, surveys_species, that only contains the `species_id` column
surveys_species = surveys_df['species_id']

# Select the species and plot columns from the DataFrame
print(surveys_df[['species_id', 'plot_id']])

# What happens when you flip the order?
print(surveys_df[['plot_id', 'species_id']])

# What happens if you ask for a column that doesn't exist?
# print(surveys_df['speciess'])

In [None]:
# Select rows 0, 1, 2 (row 3 is not selected)
print(surveys_df[0:3])

# Select the first 5 rows (rows 0, 1, 2, 3, 4)
print(surveys_df[:5])

# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
print(surveys_df[-1:])

In [None]:
# Using the 'copy() method'
true_copy_surveys_df = surveys_df.copy()

# Using the '=' operator
ref_surveys_df = surveys_df

# Assign the value `0` to the first three rows of data in the DataFrame
ref_surveys_df[0:3] = 0

# ref_surveys_df was created using the '=' operator
print(ref_surveys_df.head())

# true_copy_surveys_df was created using the copy() function
print(true_copy_surveys_df.head())

# surveys_df is the original dataframe
print(surveys_df.head())

In [None]:
surveys_df = pd.read_csv("data/surveys.csv")

# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]

# Select all columns for rows of index values 0 and 10
print(surveys_df.loc[[0, 10], :])

# What does this do?
print(surveys_df.loc[0, ['species_id', 'plot_id', 'weight']])

# What happens when you type the code below?
# print(surveys_df.loc[[0, 10, 35549], :])

In [None]:
surveys_df.iloc[2, 6]

In [None]:
# list rows equal to 2002
print(surveys_df[surveys_df.year == 2002])

print("------")
# list rows not equal to 2002
print(surveys_df[surveys_df.year != 2002])

print("------")
print(surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)])

In [None]:
pd.isnull(surveys_df)

In [None]:
# To select just the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

In [None]:
# Data types
print(type(surveys_df))
print('--')
print(surveys_df['sex'].dtype)
print('--')
print(surveys_df['record_id'].dtype)
print('--')
surveys_df.dtypes

In [None]:
# Convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
surveys_df['record_id'].dtype

In [None]:
print(surveys_df['weight'].mean())
# missing values
print(len(surveys_df[surveys_df['weight'].isna()]))
# How many rows have weight values?
print(len(surveys_df[surveys_df['weight'] > 0]))

df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

print(df1['weight'].mean())

df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())


In [None]:
# Drop all missing values
df_na = surveys_df.dropna()
# Write DataFrame to CSV
df_na.to_csv('data_output/surveys_complete.csv', index=False)

In [None]:
# Combining DataFrames with Pandas
import pandas as pd
surveys_df = pd.read_csv("data/surveys.csv",
                         keep_default_na=False, na_values=[""])
print(surveys_df)


In [None]:
species_df = pd.read_csv('data/species.csv', keep_default_na=False, na_values=[""])
species_df

In [None]:
# Read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
# Grab the last 10 rows
survey_sub_last10 = surveys_df.tail(10)
# Reset the index values to the second dataframe appends properly
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values

# Stack the DataFrames on top of each other
vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis=0)

# Place the DataFrames side by side
horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)

print(vertical_stack)
print("--")
print(horizontal_stack)

In [None]:
# Write DataFrame to CSV
vertical_stack.to_csv('data/out.csv', index=False)

# For kicks read our output back into Python and make sure all looks good
new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
print(new_output)

In [None]:
# Read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)

# Import a small subset of the species data designed for this part of the lesson.
# It is stored in the data folder.
species_sub = pd.read_csv('data/speciesSubset.csv', keep_default_na=False, na_values=[""])

print(species_sub.columns)

print(survey_sub.columns)

In [None]:
# Joins
merged_inner = pd.merge(left=survey_sub, right=species_sub, left_on='species_id', right_on='species_id')

print(merged_inner.shape)
print("--")
print(merged_inner)


In [None]:
print("-+-")

merged_left = pd.merge(left=survey_sub, right=species_sub, how='left', left_on='species_id', right_on='species_id')

print(merged_left)

print("-++-")
print(merged_left[merged_left['genus'].isna()])

In [None]:
# for loops
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)
print("++++")

for creature in animals:
    print(creature)

print("+++++++++")
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
for creature in animals:
    pass
print('The loop variable is now: ' + creature)


In [None]:
# OS calls
import os

os.mkdir('data/yearly_files')
os.listdir('data')

In [None]:
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

# Select only data for the year 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('data/yearly_files/surveys2002.csv')

surveys_df['year'].unique()

In [None]:
for year in surveys_df['year'].unique():
   filename='data/yearly_files/surveys' + str(year) + '.csv'
   print(filename)

In [None]:
# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

In [None]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year -- year for which data is extracted
    all_data -- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'data/yearly_files/function_surveys' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)



help(one_year_csv_writer)

In [None]:
# if statements
a = 5

if a<0:  # Meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>0:  # Did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else:  # Met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')



In [None]:
# plots using plotnine
%matplotlib inline
import plotnine as p9

# get data without NAs
import pandas as pd

surveys_complete = pd.read_csv('data/surveys.csv')
surveys_complete = surveys_complete.dropna()

(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point()
)

In [None]:
# use a plot template
# Create
surveys_plot = p9.ggplot(data=surveys_complete,
                         mapping=p9.aes(x='weight', y='hindfoot_length'))

# Draw the plot
surveys_plot + p9.geom_point()

In [None]:
# use transparency
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point(alpha=0.1)
)

In [None]:
# use colour
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point(alpha=0.1, colour='blue')
)

In [None]:
# colour by species_id
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
)

In [None]:
# Change X label
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
)

In [None]:
# Choose log scale for X
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
    + p9.scale_x_log10()
)

In [None]:
# Changing theme elements
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
    + p9.scale_x_log10()
    + p9.theme_bw()
    + p9.theme(text=p9.element_text(size=16))
)

In [None]:
# Using boxplots to visualise distributions
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='species_id',
                          y='weight'))
    + p9.geom_boxplot()
)

In [None]:
# overlay boxplots with points to show population of each distribution
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='species_id',
                          y='weight'))
    + p9.geom_jitter(alpha=0.2)
    + p9.geom_boxplot(alpha=0.)
)

In [None]:
yearly_counts = surveys_complete.groupby(['year', 'species_id'])['species_id'].count()
print(yearly_counts)
print("-+_")

yearly_counts = yearly_counts.reset_index(name='counts')
print(yearly_counts)

In [None]:
(p9.ggplot(data=yearly_counts,
           mapping=p9.aes(x='year',
                          y='counts'))
    + p9.geom_line()
)


In [None]:
# take 2 - split by species
(p9.ggplot(data=yearly_counts,
           mapping=p9.aes(x='year',
                          y='counts',
                          color='species_id'))
    + p9.geom_line()
)

In [None]:
# Splitting using facetting
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_wrap("plot_id")
)

In [None]:
# only select the years of interest
survey_2000 = surveys_complete[surveys_complete["year"].isin([2000, 2001])]

(p9.ggplot(data=survey_2000,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_grid("year ~ sex")
)

In [None]:
# bar plots
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='factor(year)'))
    + p9.geom_bar()
)

In [None]:
# fix year labels
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='factor(year)'))
    + p9.geom_bar()
    + p9.theme_bw()
    + p9.theme(axis_text_x = p9.element_text(angle=90))
)

In [None]:
# storing a custom theme as object for later usage
my_custom_theme = p9.theme(axis_text_x = p9.element_text(color="grey", size=10,
                                                         angle=90, hjust=.5),
                           axis_text_y = p9.element_text(color="grey", size=10))

(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='factor(year)'))
    + p9.geom_bar()
    + my_custom_theme
)

In [None]:
# Saving a plot
my_plot = (p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point()
)
my_plot.save("scatterplot.png", width=10, height=10, dpi=300)