# Introduction to Python

## Part II: Basics of Python Programming

This document is heavily based on the Python-Novice-Gapminder lesson developed by Software Carpentry, and the original lesson can be found online at http://swcarpentry.github.io/python-novice-gapminder/

In this notebook we will learn more about data structures. 

Data structures are structures which can hold some data together. In other words, they are used to store a collection of related data.

There are four built-in data structures in Python - list, tuple, dictionary and set.

Data structures allow us to manipulate the data in specific ways. 

## Lists
Often in Python we'll have a collection of items; it could be the rows of a column of a dataset. We'll want to have one variable to represent the entire collection instead of different variables for each one.

We can use a list.

In [None]:
measurements = [0.273, 0.275, 0.277, 0.275, 0.276]
print(measurements)

Like with strings we can use the `len` function to get the length and use slices to get parts of the list.

In [None]:
len(measurements)

In [None]:
measurements[0]

In [None]:
measurements[2:4]

Lists have many useful built in functions.

In [None]:
dir(measurements)

Two useful functions will be `append` and `remove`.

In [None]:
measurements.append(0.37)
measurements # we see 0.37 now

In [None]:
measurements.remove(0.275) # remove the first 0.275 item
measurements

In [None]:
help(measurements.append)

In [None]:
help(measurements.remove)

The above version of `remove` removes when we know the value. What if we want to easily remove an item by its index?

In [None]:
del measurements[1] # remove the second element; 0.277 will be missing
measurements

Lists can store different types of items, even at the same time.

In [None]:
info = ['ASB10908', 25, 'Daniel', 13.2] # no problem mixing numbers and strings
info

In [None]:
info.append(measurements) # lists can even contain lists, although it can start getting complicated
info

We can change specific items in the list

In [None]:
info[2] = 'Joel'
info

Lists can also be empty

In [None]:
new_list = []
new_list

In [None]:
len(new_list)

Empty lists can be useful when we plan on adding things to them later using code.

### Practice

Create a list containing the names of at least 3 people you know.
* Using that list, add a fourth person to the end.
* Using that list, capitalize the name of the second person (you may need to look back at the methods we used on strings)
* Use the `index` method for lists (look at help) to find the position of one of the names. Use `del` to then remove that person.

## Tuples

In Python, tuples are part of the standard language. This is a data structure very similar to the list data structure. The main difference being that tuple manipulation are faster than list because tuples are immutable.

In [1]:
l = (1, 2, 3)
l[0]

1

In [2]:
l = 1, 2
l

(1, 2)

If you want to create a tuple with a single element, you must use the comma:



In [3]:
singleton = (1, )

You can repeat a tuples by multiplying a tuple by a number:



In [4]:
(1,) * 5

(1, 1, 1, 1, 1)

Note that you can concatenate tuples and use augmented assignement (*=, +=):

In [5]:
s1 = (1,0)
s1 += (1,)
s1

(1, 0, 1)

Tuple methods

In [8]:
## index, to find occurence of a value
# count, to count the number of occurence of a value
l.count(2)

1

In [9]:
# slicing
(x,y,z) = ('a','b','c')
print(x)

(x,y,z) = range(3)
print(x)

a
0


In [10]:
# length
t= (1,2)
print(len(t))

#Slicing (extracting a segment)
t = (1,2,3,4,5)
print(t[2:])
(3, 4, 5)

2
(3, 4, 5)


(3, 4, 5)

## Dictionaries

A dictionary is a sequence of items. Each item is a pair made of a key and a value. Dictionaries are not sorted. You can access to the list of keys or values independently.



In [11]:
d = {'first':'string value', 'second':[1,2]}
d.keys()

dict_keys(['first', 'second'])

In [12]:
d.values()

dict_values(['string value', [1, 2]])

In [13]:
d['first']


'string value'

## Set

Sets are constructed from a sequence (or some other iterable object). Since sets cannot have duplicated, there are usually used to build sequence of unique items (e.g., set of identifiers).

In [14]:
a = set([1, 2, 3, 4])
b = set([3, 4, 5, 6])
a | b # Union

{1, 2, 3, 4, 5, 6}

In [15]:
a & b # Intersection

{3, 4}

In [16]:
a < b # Subset

False

In [17]:
a - b # Difference

{1, 2}

In [18]:
a ^ b # Symmetric Difference


{1, 2, 5, 6}

## Importing Data Using Pandas Library

One of the goals of this workshop is to introduce doing data analysis in Python. This requires being able to read data into Python. The `pandas` library will provide the functionality to load data up and work them through a data frame object.

First we'll see how to load up the data. If you get an error please immediately ask for help.

In [None]:
import pandas
oceania_data = pandas.read_csv('data/gapminder_gdp_oceania.csv')

We're calling the `read_csv` function in the `pandas` library. Right now we're giving it one text argument which is the file path to the file we're reading in. If you go back to the file manager view of Jupyter you'll see a folder called `data`, which is containing several files, one of which is `gapminder_gdp_oceania.csv`.

In [None]:
oceania_data

Jupyter is clever enough to recognize this data is a table and display it to us in a table format. We see here that we first start with a column called `country`, and then proceeed to have GDP measurements over many years. Each row of this dataset currently has a name, which is just '0' or '1'. In many cases we'd like to specify that the names of the rows are denoted by the `country` column. We can do that in the `read_csv` function by specifying the `index_col` parameter. 

Note here that this is the first time we've called a function or method with a *named* parameter; this is extremely useful when a function or method might have dozens of default parameters and we want to change just one, in which case we don't have to list through all the other parameters in the exact right order.

In [None]:
oceania_data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
oceania_data

Now we see that the country column is used to set the row names. 

There are a lot of things available on pandas dataframes to use.

In [None]:
dir(oceania_data)

We're going to focus on just a few.

In [None]:
# tells us information about how any rows we have (2) 
# and what type is stored in each column (float64, which is a decimal number)
oceania_data.info() 

In [None]:
# We can access specific columns by their name
oceania_data.gdpPercap_2007

In [None]:
# We can look at the transpose (flipped) version of the dataframe
oceania_data.T

In [None]:
# We can use the describe() method to get summary statistics for each column
oceania_data.describe()

In [None]:
# We can get the names of the columns as (essentially) a list
oceania_data.columns

In [None]:
# We can get the names of the rows as (essentially) a list
print(oceania_data.index)

print(len(oceania_data.index)) # get the number of rows

### Practice

Import the `gapminder_gdp_asia.csv` file with `index_col` correctly set. Run its `describe` and `info` methods.

How would you run `describe` such that you get summary information for each country instead of each column?

## Selecting Values in Pandas DataFrames

Remember with lists how we could use square brackets to select specific values? We can (mostly) do the same thing with DataFrames, but there are two ways to do it.

In [None]:
# Just loading a larger dataset first
europe_data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

If we want to select values based on their indexes, like with lists, we use `iloc`.

In [None]:
# Select the value in the second row and first column
europe_data.iloc[1, 0]

In [None]:
# Select the first 3 rows and first 5 columns
europe_data.iloc[0:3, 0:5]

In [None]:
# Select the first 3 rows and last 3 columns
europe_data.iloc[0:3, (len(europe_data.columns)-3):] # Note how only the left side of ':' is defined

In [None]:
# Select the first 3 rows and all the columns
europe_data.iloc[0:3, :]

If we want to select values based on their row and column names, then we use `loc`

In [None]:
# Two gdp values for Germany at 1952 and 2007
europe_data.loc['Germany', ['gdpPercap_1952', 'gdpPercap_2007']]

In [None]:
# Two gdp values for Germany and United Kingdom at 1952 and 2007
europe_data.loc[['Germany', 'United Kingdom'], ['gdpPercap_1952', 'gdpPercap_2007']]

In [None]:
# We can also use ':' to go between values with loc, although it will be inclusive now
europe_data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']

### Practice

Using the `europe_data` dataframe:
* Find out the GDP for France in 2002.
* List all of Norway's GDP
* List the first 5 GDPs for the last 3 countries in the dataframe

### Question

How would you display `gdpPercap_1952` and `gdpPercap_2007` for the first 5 rows of the data without finding out which column indexes those columns have? We can only use `iloc` for numeric indexes (which we want for rows) and `loc` for names (which we want for columns).

Answer: 

## Analyzing Data in Pandas DataFrames

Take a look at the `help` page for the `apply` method on DataFrames.

In [None]:
help(europe_data.apply)

There's a lot of parameters but the two important ones are `func` and `axis`; this function lets us analyze all of the columns (or rows) of the dataframe simultaneously. Also, each column (or row) is a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) object from the pandas package, which means that while it can be a list of numbers, it has methods for easily calculating things like the mean or sum, and when algebraically used operations apply to every element. You can visit the link to see other methods.

Just look at some examples

First; let's get the average GDP for each year it was measured. I'm going to use something called a *lambda* function; basically we define a very simple function that takes in a Series called x and then does something with it. The `:` denotes when the code starts. Here all we do is call the `mean` method on the series.

`axis='index'` tells the `apply` method to compress rows ('index') together, forming a mean of each column.

In [None]:
europe_data.apply(lambda x: x.mean(), axis='index')

In [None]:
# Here's if we compressed rows together
europe_data.apply(lambda x: x.mean(), axis='columns')

The series doesn't actually have to only return one value per column / row. Next I'll calculate the average GDP for each year, and divide each country's GDP by that average so that the countries are more comparable.

Note how we take x, which is not a single number, and divide it by a single number? The division operator was defined for Series which is why it works here.

In [None]:
europe_relative_data = europe_data.apply(lambda x: x / x.mean(), axis='index')
europe_relative_data.head() # head() is a method to only display the first 5 rows; I use it so that screen isn't filled

### Practice

* Calculate the relative percent change in GDP for each country between every measurement. Note that there is a method on a `Series` (what `x` is) called `pct_change`. 
* The result you produced will have NaN as the first column; remove it.
* Save this new data under a variable name; you'll use it in a later practice.