# Introduction to Python

Welcome to UIC ACER's Introduction to Python! 
This course introduces you Python by working through common tasks in data science: 
importing, manipulating, and exporting data.

This tutorial is adapted from the [Fred Hutch Introduction to Python](https://github.com/fredhutchio/python_intro/) and the [Data Carpentry Python for Ecologists](https://datacarpentry.org/python-ecology-lesson/) materials.

## Learning Objectives

By the end of this tutorial, you should be able to:

- work in a Jupyter notebook to run and record Python code
- understand basic Python syntax to use functions and assign variables
- create lists
- use for-loops
- define functions
- load packages and spreadsheet-style data using Python
- extract columns, rows, and portions thereof from datasets
- calculate summary statistics
- subset data by identifying rows that meet particular conditions
- group data to summarize by category
- accommodate missing data
- export data using pandas

In [None]:
# addition
3 + 2

In [None]:
# addition without spaces, same result
3+2

In [None]:
# subtraction
3 - 2

In [None]:
# multiplication
3 * 2

In [None]:
# division
3 / 2

In [None]:
# exponentiation
3 ** 2

In [None]:
# modulus (remainder)
3 % 2

In [None]:
# greater than
3 > 4 

In [None]:
# less than
3 < 4

In [None]:
# equal to
3 == 4

In [None]:
# less than or equal to
3 <= 4

In [None]:
# storing values in variables
pi = 3.1415

In [None]:
pi * 2

In [None]:
# storing multiple variables
weight_kg = 22
weight_lb = weight_kg / 2.2
weight_lb

In [None]:
# updating weight_kg does not automatically update weight_lb
weight_kg = 38.5
weight_lb

In [None]:
# redefine weight_lb
weight_lb = weight_kg / 2.2
weight_lb

In [None]:
# round pi to a whole number
round(pi)

In [None]:
# pass a second argument for the number of digits to round to
round(pi, 1)

In [None]:
# use a named argument for the number of digits
round(pi, ndigits=1)

In [None]:
# find help on a function
help(round)

In [None]:
# data types in python
integer = 42 # integer
real = 3.1415 # float
text = "UIC ACER" # string

In [None]:
text

In [None]:
# print output to screen
print(text)

In [None]:
# assign a list to a variable
numbers = [1, 2, 3, 4, 5, 6]
numbers

In [None]:
# access first element in list
numbers[0] 

In [None]:
# access last element in list
numbers[-1]

In [None]:
# access second-to-last element in list
numbers[-2]

In [None]:
# access a range
numbers[1:4]

In [None]:
# from the start up to but not including index 3
numbers[:3]

In [None]:
# from index 3 to the end
numbers[3:]

In [None]:
# the last three elements
numbers[-3:]

In [None]:
# add element (number) to end of list
numbers.append(7)

In [None]:
print(numbers)

In [None]:
# modify existing element
numbers[1] = 17
print(numbers)

In [None]:
# calculate length
len(numbers)

In [None]:
# lists of string data
organs = ["lung", "kidney", "heart"]

# for loop to access elements in list one at a time
for organ in organs:
    print(organ)

In [None]:
# define a chunk of code as function
def plus_ten(a):
    result = a + 10
    return result

In [None]:
# apply the function
z = plus_ten(21)
print(z)

In [None]:
for num in numbers:
    new_num = plus_ten(num)
    print(new_num)

In [None]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd

In [None]:
# create data directory
os.mkdir("data")

In [None]:
# download dataset
urllib.request.urlretrieve("https://raw.githubusercontent.com/jsfalk/acer_tutorials/main/r_intro/data/animals.csv", "data/animals.csv")

In [None]:
# assign data to variable
animal_df = pd.read_csv("data/animals.csv")

In [None]:
# preview first few rows of the data
animal_df.head()

In [None]:
# print last eight rows of data to screen
animal_df.tail(8) # pass argument for number of rows

In [None]:
# number of rows
len(animal_df)

In [None]:
# number of columns
len(animal_df.columns)

In [None]:
# print summary
animal_df.info() 

In [None]:
# show only the first few rows of one column
animal_df["taxa"].head()

In [None]:
# show data type for a column
animal_df["taxa"].dtype 

In [None]:
# access columns by name using dot syntax
animal_df.taxa.head()

In [None]:
# select two columns at once
animal_df[["taxa", "year"]].head()

In [None]:
# access three rows 
animal_df[1:4]

In [None]:
# access the first three rows
animal_df[:3].tail()

In [None]:
# access the last five rows
animal_df[34780:].tail()

In [None]:
# access the last row in the data frame
animal_df[-1:] 

In [None]:
# access one data element from a single cell
animal_df.iloc[2, 1]

In [None]:
# select range of data
animal_df.iloc[0:3, 1:4]

In [None]:
# empty stop boundary to indicate end of data
animal_df.iloc[:2, 3:]

In [None]:
# slicing using loc
animal_df.loc[1:4]

In [None]:
# empty stop boundary to indicate end of data
animal_df.loc[34781: ]

In [None]:
# Select all columns for rows of index values specified
animal_df.loc[[0, 10, 6831], ]

In [None]:
# select first row for specified columns
animal_df.loc[0, ["year", "weight", "genus"]]

In [None]:
# select first five rows for specified columns
animal_df.loc[0:5, ["year", "weight", "genus"]]

In [None]:
# calculate basic stats a single column
animal_df.weight.describe()

In [None]:
# calculate only the minimum for weight
animal_df.weight.min()

In [None]:
# convert weight column from grams to ounces
animal_df.weight.head() / 28.35

In [None]:
# convert maximum weight to ounces
animal_df.weight.max() / 28.35

In [None]:
# add converted column
animal_df['weight_oz'] = animal_df.weight / 28.35
animal_df.head()

In [None]:
# max of weight-converted-to-oz equivalent to max-of-weight converted to oz
animal_df.weight_oz.max()

In [None]:
# summary for string data
animal_df.genus.describe()

In [None]:
# test equality
animal_df.year == 1998

In [None]:
# conditionally subset all samples collected in 1998
animal_df[animal_df.year == 1998].head()

In [None]:
# conditionally subset all samples NOT collected in 1998
animal_df[-(animal_df.year == 1998)].head()

In [None]:
# shorter notation for not equal
animal_df[animal_df.year != 1998].head()

In [None]:
# extract all samples collected between 1998 and 2000
animal_df[(animal_df.year >= 1998) & (animal_df.year <= 2000)].head()

In [None]:
# extract all data for samples collected in 1998 or 1999
animal_df[(animal_df.year == 1998) | (animal_df.year == 1999)].head()

In [None]:
# identify unique elements in a column
pd.unique(animal_df["sex"])

In [None]:
# group data by sex 
grouped_data = animal_df.groupby("sex")

In [None]:
# summary stats for all columns by taxa
grouped_data.describe()

In [None]:
# summary stats for race for only one column (days_to_death)
grouped_data.weight.describe()

In [None]:
# show the number of samples of each taxa available for each columns 
grouped_data.count()

In [None]:
# counts for only weight
grouped_data.weight.count()

In [None]:
# count the number of each sex for which hindfoot length is available
grouped_data.hindfoot_length.count()

In [None]:
# only display one sex (M), from hindfood_length grouped by sex
grouped_data.hindfoot_length.count().M

In [None]:
# another way: only display one sex (M), from hindfoot_length grouped by sex
animal_df.groupby("sex")["hindfoot_length"].count()["M"]

In [None]:
# save output to object for later use
sex_counts = grouped_data.hindfoot_length.count()
print(sex_counts)

In [None]:
# test if value is missing
pd.isnull(animal_df).head()

In [None]:
# extract all rows WITHOUT missing data
len(animal_df.dropna())

In [None]:
# length of original data
len(animal_df)

In [None]:
# create new copy of data frame
animal_copy = animal_df.copy()

In [None]:
# replace missing values with mean
mean_weight = animal_df.weight.mean()
animal_copy.weight = animal_copy.weight.fillna(mean_weight)

In [None]:
# exclude missing data in only weight
weight_complete = animal_df.dropna(subset = ["weight"])

In [None]:
# save filtered data to file
weight_complete.to_csv("data/weight_complete.csv", index=False)

In [None]:
# Drop NaN
animals_reduced = animal_df.dropna(subset = ["sex", "hindfoot_length", "weight"])

In [None]:
# show categories
pd.unique(animals_reduced.sex)

In [None]:
# remove missing values that aren't NaN
animals_reduced = animals_reduced[animals_reduced.sex != "not reported"]

In [None]:
pd.unique(animals_reduced.sex)

In [None]:
# count number of samples for each species
species_counts = animals_reduced.groupby("species").species.count()
species_counts

In [None]:
# reset index to default
species_counts = species_counts.reset_index(name="counts")
species_counts

In [None]:
# keep only species with many observations
frequent_species = species_counts[species_counts.counts > 500]
frequent_species

In [None]:
# extract values for frequently occurring species
animals_reduced = animals_reduced[animals_reduced["species"].isin(frequent_species.species)]
animals_reduced.head()

In [None]:
# write data to csv
animals_reduced.to_csv("data/animals_reduced.csv", index=False)