Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions 00-short-introduction-to-Python.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ elements. Each element can be accessed by its index:

```python
>>> numbers = [1,2,3]
>>> l[0]
>>> numbers[0]
1
```

Expand Down Expand Up @@ -181,7 +181,7 @@ AList = [1,2,3]

### Challenge
1. What happens when you type `ATuple[2]=5` vs `AList[1]=5` ?
2. Type `type(ATuple)` into python - what is the object type?
2. Type `type(ATuple)` into python - what is the object type?


## Dictionaries
Expand Down
70 changes: 35 additions & 35 deletions 01-starting-with-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ Each time we call a function that's in a library, we use the syntax
`LibraryName.FunctionName`. Adding the library name with a `.` before the
function name tells python where to find the function. In the example above, we
have imported pandas as `pd`. This means we don't have to type out `pandas` each
time we call a pandas function.
time we call a pandas function.


## Lesson Overview

We are studying the species and weight of animals caught in plots in a study
area. The data sets are stored in .csv (comma separated value) format. Within
the `.csv` files, each row holds information for a single animal, and the
columns represent: record_id, month, day, year, plot, species, sex, wgt.
columns represent: record_id, month, day, year, plot, species, sex, wgt.

The first few rows of our first file look like this:

Expand Down Expand Up @@ -86,17 +86,17 @@ and they can replicate the same analysis.
# Reading Data Using Pandas CSV
We will begin by locating and reading our survey data which are in CSV format.
We can use Pandas `read_csv` function to pull the file directly into a
[DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).
[DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).

## So What's a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different
types (including characters, integers, floating point values, factors and more)
in columns. It is similar to spreadsheets or SQL tables or the `data.frame` in
R.
R.

First, let's make sure the python Pandas library is loaded. We will import
Pandas using the nickname `pd`.
Pandas using the nickname `pd`.

```python
import pandas as pd
Expand All @@ -109,8 +109,8 @@ IPython Notebook, be sure to start the notebook in the workshop repository.
If you didn't do that you can always set the working directory using the code
below.

```python
import OS
```python
import os
os.getcwd()
# if this directory isn't right, use the command below to set the working directory
os.chdir("YOURPathHere")
Expand Down Expand Up @@ -147,9 +147,9 @@ columns. It looks like the `read_csv` function in Pandas read our file
properly. However, we haven't saved any data to memory so we can work with it.
We need to assign the DataFrame to a variable. Remember that a variable is a
name for a value, such as `x`, or `data`. We can create a new object with a
variable name by assigning a value to it using `=`.
variable name by assigning a value to it using `=`.

Let's call the imported survey data `surveys_df`:
Let's call the imported survey data `surveys_df`:

```python
surveys_df = pd.read_csv("data/surveys.csv")
Expand Down Expand Up @@ -191,7 +191,7 @@ which returns:
Now we can start manipulating our data! First, let's check data type of object
that `surveys_df` is using the `type` command. The `type` function and
`__class__` attribute tell us that `surveys_df` is `<class
'pandas.core.frame.DataFrame'>` in Python.
'pandas.core.frame.DataFrame'>` in Python.

```python
type(surveys_df)
Expand All @@ -201,7 +201,7 @@ surveys_df.__class__
We can also use the `surveys_df.dtypes` command to view the data type for each
column in our dataframe. Int64 represents numeric integer values - int64 cells
can not store decimals. Object represents strings (letters and numbers). Float64
represents numbers with decimals.
represents numbers with decimals.

surveys_df.dtypes

Expand All @@ -218,23 +218,23 @@ sex object
wgt float64
dtype: object
```
We'll talk a bit more about what the different formats mean in a different lesson.
We'll talk a bit more about what the different formats mean in a different lesson.

### Useful Ways to View DataFrame objects in Python

There are multiple methods that can be used to summarize and access the data
stored in dataframes. Let's try out a few. Note that we call the method by using
the object name `surveys_df.method`. So `surveys_df.columns` provides an index
of all of the column names in our DataFrame.
of all of the column names in our DataFrame.

## Challenges
Try out the methods below to see what they return.

1. `surveys_df.columns`
1. `surveys_df.columns`
2. `surveys_df.head()`. Also, what does `surveys_df.head(15)` do?
3. `surveys_df.tail()`.
4. `surveys_df.shape` - Take note of the output of the shape method. What format does it return the shape of the DataFrame in?
3. `surveys_df.tail()`.
4. `surveys_df.shape` - Take note of the output of the shape method. What format does it return the shape of the DataFrame in?

HINT: [More on tuples, here](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences).


Expand Down Expand Up @@ -286,7 +286,7 @@ which **returns**:

We often want to calculate summary statistics grouped by subsets or attributes
within fields of our data. For example, we might want to calculate the average
weight of all individuals per plot.
weight of all individuals per plot.

We can also extract basic statistics for all rows in a column, individually
using the syntax below:
Expand Down Expand Up @@ -320,7 +320,7 @@ surveys_df['wgt'].count()

But if we want to summarize by one or more variables, for example sex, we can
use the `.groupby` method in Pandas. Once we've created a groupby DataFrame, we
can quickly calculate summary statistics by a group of our choice.
can quickly calculate summary statistics by a group of our choice.

```python
# Group data by sex
Expand All @@ -337,13 +337,13 @@ numeric data.
sorted.describe()
# provide the mean for each numeric column by sex
sorted.mean()
```
```

`sorted.mean()` **OUTPUT:**

```python
record_id day year plot wgt
sex
sex
F 18036.412046 16.007138 1990.644997 11.440854 42.170555
M 17754.835601 16.184286 1990.480401 11.098282 42.995379
P 22488.000000 21.000000 1995.000000 8.000000 13.000000
Expand All @@ -362,19 +362,19 @@ is important to explore your data, before diving into analysis too quickly.
# Challenge

1. Have a look at the output of the `describe` method below. How many columns
represent the sex "Z", "P" or "R".
represent the sex "Z", "P" or "R".
2. What happens when you group by two columns using the syntax and then grab
mean values:
`sorted2 = surveys_df.groupby(['plot','sex'])`
sorted.mean()
mean values:
- `sorted2 = surveys_df.groupby(['plot','sex'])`
- `sorted2.mean()`
3. Summarize weight values for each plot in your data. HINT: you can use the
following syntax only create summary statistics for one column in your data
`byPlot['wgt'].describe()`
`byPlot['wgt'].describe()`


Did you get #3 right? **A Snippet of the Output from challenge 3 looks like:**

plot
plot
1 count 1903.000000
mean 51.822911
std 38.176670
Expand All @@ -387,7 +387,7 @@ Did you get #3 right? **A Snippet of the Output from challenge 3 looks like:**
## Quickly Creating Summary Counts in Pandas

Let's next create a list of unique species in our data. We can do this in a few
ways.
ways.

1. We can use the `nunique()` method which return a pandas SERIES of unique
elements in the data and a n associated count of how many rows contain that
Expand All @@ -397,7 +397,7 @@ ways.
species_list = surveys_df.record_id.groupby(surveys_df.species).nunique()
```

Or, we can also count just the rows that have the species "DO":
Or, we can also count just the rows that have the species "DO":

```python
surveys_df.record_id.groupby(surveys_df['species']).nunique()['DO']
Expand Down Expand Up @@ -446,7 +446,7 @@ total_count.plot(kind='bar');

1. Create a plot of average weight across all species per plot.
2. Create a plot of total males versus total females for the entire dataset.
3. Create a stacked bar plot that has male vs female for each plot.
3. Create a stacked bar plot that has male vs female for each plot.



Expand All @@ -464,16 +464,16 @@ being sex. Here is some data to help you solve this challenge:

```
wgt
sex F M
plot
1 46.311138 55.950560
2 52.561845 51.391382
sex F M
plot
1 46.311138 55.950560
2 52.561845 51.391382

data.plot(kind='bar',stacked=True,title="Total Weight by Plot and Sex")
my_plot.set_xlabel("Plot")
my_plot.set_ylabel("Weight")
```


* You can use the command `unstack` to transform grouped data into columns for
each plotting. Try running `surveys_df.unstack' and see what it yields.
Expand Down
4 changes: 2 additions & 2 deletions 08-working-with-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ con.close()

Storing your data in an SQLite database can provide substantial performance
improvements when reading/writing compared to CSV. The difference in performance
when becomes more noticable as the size of the dataset grows (see for example
[these benchmarks]).
becomes more noticable as the size of the dataset grows (see for example [these
benchmarks]).

[these benchmarks]: http://sebastianraschka.com/Articles/sqlite3_database.html#benchmarks