Skip to content

Commit

Permalink
Merge pull request #78 from seralf/gh-pages
Browse files Browse the repository at this point in the history
added first basic conversions rst->md for /recipes etc
  • Loading branch information
rufuspollock committed May 29, 2015
2 parents 7c727ad + 6e781f0 commit 3eb3e2c
Show file tree
Hide file tree
Showing 47 changed files with 4,069 additions and 0 deletions.
174 changes: 174 additions & 0 deletions text/appendix/glossary.md
@@ -0,0 +1,174 @@

Glossary
========

Open Data
---------

Open data is data that can be used, reused and redistributed freely by anyone for any purpose. More details can be found at at [opendefinition.org](http://www.opendefinition.org/).

Machine-readable
----------------
Formats that are machine readable are ones which are able to have their data extracted by computer programs easily. PDF documents are not machine readable. Computers can display the text nicely, but have great difficulty understanding the context that surrounds the text. Common machine-readable file formats are [`CSV`]() and Excel Files.

Readme
------
A file (usually named `README` or `README.txt`) that explains new users what the current directory or set of files is about. This is very commonly found in open source software projects and is considered good practice to be included with various publications (including datasets).
The file usually contains a short description of what to expect.

BitTorrent
----------
BitTorrent is a protocol for distributing the bandwith for transferring very large files between the computers which are participating in the transfer. Rather than downloading a file from a specific source, BitTorrent allows peers to download from each other.

JSON
----
JavaScript Object Notation. A common format to exchange data. Although it is derived from Javascript, libraries to parse JSON data exist for many programming languages. Its compact style and ease of use has made it widespread. To make viewing JSON in a browser easier you can install a plugin such as [JSONView in Chrome](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc)
and [JSONView in Firefox](https://addons.mozilla.org/en-us/firefox/addon/jsonview/).

GDP
---
Gross domestic product (GDP) is the market value of all officially recognized goods and services produced within a country in a given period of time. GDP per capita is often considered an indicator of a country's standard of living. (Source: Wikipedia.)

GeoJSON
-------
GeoJSON is a format for encoding a variety of geographic data structures. It is based on the :term:`JSON` specification. More documentation can be found on [http://www.geojson.org](http://www.geojson.org).

Geocoding
---------
From Geographical Coding. Describes the practice of attaching geographical coordinates to items.

Geocode
-------
see `Geocoding`

CSV
---
Comma Separated Values. A very simple, open format for tabular data which can be exported and imported by all spreadsheet applications and is easily manipulable with command line tools.

Comma-separated Values
----------------------
See `CSV`

curl
-----
[http://curl.haxx.se/](http://curl.haxx.se/) - a command line tool for transferring data to and from online systems over standard internet protocols including FTP and HTTP. Very powerful and great for working with `Web API` s from the command line.

DAP
---
See `Data Access Protocol`.

Data Access Protocol
--------------------
A system that allows outsiders to be granted access to databases
without overloading either system.

etherpad
--------
A piece of software for collaborative real-time editing of text. See
http://etherpad.org/.

Attribution Licence
-------------------
A licence that requires attributing the original source of the licensed material.

API
---
See `Application Programming Interface`.

Application Programming Interface
---------------------------------
A way computer programmes talk to one another. Can be understood in terms of how a programmer sends instructions between programmes.

Web API
-------
An `API` that is designed to work over the Internet.

Share-alike Licence
-------------------
A licence that requires users of a work to provide the content under the same or similar conditions as the original.

Public domain
-------------
No copyright exists over the work. Does not exist in all jurisdictions.

Open standards
--------------
Generally understood as technical standards which are free from licencing restrictions. Can also be interpreted to mean standards which are developed in a vendor-neutral manner.

Anonymisation
-------------
The process of treating data such that it cannot be used for the identification of individuals.

IP rights
---------
See `Intellectual property rights`.

Intellectual property rights
----------------------------
Monopolies granted to individuals for intellectual creations.

Tab-separated values
--------------------
Tab-separated values (TSV) are a very common form of text file format for sharing tabular data. The format is extremely simple and highly `machine-readable`.

Taxonomy
--------
Classification. Taxonomy refers to hierarchical classification of things. One of the best known is the Linnean classification of species - still used today to classify all living beings.

Qualitative Data
----------------
Qualitative data is data telling you something about qualities: e.g. description, colors etc. Interviews count as qualitative data

Quantitative Data
-----------------
Quantitative data tells you something about a measure or quantification. Such as the quantity of things you have, the size (if measured) etc.

Crowdsourcing
-------------
Mashup of crowd and outsourcing: Having a lot of people do simple tasks to complete the whole work.

Choropleth Map
--------------
A choropleth map is a map where value are encoded onto regions using colormapping. The whole region is colored using the underlying value.

Mean
----
The arithmetic mean of a set of values. Calculated by summing up all values and then dividing by the number of values.

Normal Distribution
-------------------
The normal (or Gaussian) distribution is a continuous probability distribution with a bell shaped curve.

Median
------
The median is defined as the value where 50% of values in a range will be below, 50% of values above the value.

Quartiles
---------
Quartiles are the values where 25, 50 and 75% of values in a range are below the given value.

Percentiles
-----------
Percentiles are a value where n% of values are below in a given range.
e.g. the 5th percentile: 5 percent of values are lower than this value.

Scraping
--------
The process of extracting data in :term:`machine-readable` formats of non-pure data sources e.g.: webpages or PDF documents. Often prefixed with the source (web-scraping PDF-scraping).

Categorical Data
----------------
Data that helps put things into categories. E.g.: Country names, Groups, Conditions, Tags

Discrete Data
--------------
Numerical Data that, if you plot all possible values, has gaps in it.
E.g. the count of things (there are no 1.5 children). Compare to `Continuous Data`

Continuous Data
---------------
Numerical data that, if you plot all possible values, has no gaps. E.g. Sizes (you can be 155.55 or 155.56cm tall etc.) Compare to `Discrete Data`

Boolean logic
-------------
A form of algebra in which all values are reduced to either `TRUE` or `FALSE`.
38 changes: 38 additions & 0 deletions text/archiving-twitter.md
@@ -0,0 +1,38 @@

Archiving Twitter
=================

Twitter data is only available via the search API for up to 7 days. Data for a given account only goes back a few thousand tweets. Thus archiving tweets can be a useful activity. This entry details a few options and in the process shows some neat tips and tricks for pulling down data.

Using Google Reader
-------------------

Twitter still gives out Atom/RSS feeds (though they are increasingly hidden!).

You can thus use Google Reader, or any other feed reader with auto-archiving capabilities, to archive your twitter feed.

### Constructing the query


You need to escape non-ascii characters in the query. E.g. Here's a query for `#okfn`:

http://search.twitter.com/search.atom?q=%23okfn

Here's one for tweets @okfn:

http://search.twitter.com/search.atom?q=%40okfn

Here are tweets from `@okfn` (query is `from:okfn`):

http://search.twitter.com/search.atom?q=from%3Aokfn

### Add it on your Google Reader Account

Sign in or sign up and then add the link provided above. Archiving will now continue.


Using Javascript and the DataHub
--------------------------------

See https://github.com/OKFN-BR/BusaoSP/blob/master/getdata.js

102 changes: 102 additions & 0 deletions text/csv.md
@@ -0,0 +1,102 @@
CSV: Lingua Data
================

People love CSV (Comma-Separated Values) for its simplicity: it stores tables in plain text files, one row per line with the first row defining column names. In many ways this is the lingua franca of data. Things become a bit messy, however, when you realize that very little of the above description is ever true in practice: rows can extend across several lines, file headers are often missing or preceded by some random headings and the flexible format even invites producers to vary the number of cells per row.

Things that are not true about CSV:

* Every row is one line, every line is a row.
* The first column contains column headings.
* All rows have the same number of columns.

Before you process CSV files
----------------------------

It is advisable to deal with CSV encoding and quoting issues early in your workflow.

If there's a chance that your CSV file contains non-English words, or English proper names such as surnames or placenames, then you should verify that the data is in the character set encoding that you expect, e.g., `UTF-8` or `ISO-8859-1`. Otherwise, convert it to the encoding you work in, using iconv.
GNU iconv is limited to converting files which will fit in the RAM available on your machine and which contain data in a single character set encoding.

There are multiple methods for quoting the markers which delimit fields and lines in CSV files. The tool which generated your CSV files may have done so such that they are unreadable by other computer programmes.
In particular, a naive CSV implementation may have left backslashes or double quotes near the edges of fields in way that Excel will ignore but which are unacceptable to stricter systems such as databases. Try to identify these issues early; they may be trivially fixable with basic
UNIX tools such as `tr` and `sed`.

CSV options
-----------

The markers for lines and fields differ between CSV files. There are four of them: line terminators, field separators, field quotes, and escape markers.

CSV files comprise a set of lines. Each line is followed by a termination marker, including the final line. Within each line there are fields.

* field separator
* field quoting (delimiter and policy)
* line separator (at end of every line)


Exporting from a relational database to CSV
-------------------------------------------
* MySQL (server-side):
```SQL
SELECT INTO OUTFILE
```
* postgres:
```SQL
COPY 'table_name' TO STDOUT WITH ...
```
* sqlite:
```SQL
.dump
```
* DB2:
```SQL
EXPORT COLSEP=0x09 SELECT * FROM schema.table;
```
* SQL Server:
```SQL
bcp
```

Import CSV to a relational database
-----------------------------------

* MySQL:
```SQL
LOAD DATA LOCAL INFILE '/path/to/file.csv' INTO TABLE 'table_name' FIELDS SEPARATED BY '\t' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES; SHOW WARNINGS;
```
* postgres:
```SQL
COPY 'table_name' FROM '/path/to/file.csv' WITH ...
```
* sqlite3:
```SQL
.mode ; .import
```
* python:
```SQL
.executemany()
```

Exporting CSV from spreadsheets
-------------------------------
* Excel gotchas
* Refine gotchas
* Gnumeric gotchas


programming
-----------
* python csv module
* awk


Folding nested values into CSV
------------------------------

About halfway through producing a CSV export, you usually realize that the data does not neatly lend itself to be serialized into a single table: a cell value can really only be expressed as a mapping or it can have several values at once. At this stage you have several options:

* Quote CSV in CSV. Yo dawg, don't do that, please.
* Generate a proper relational database dump or SQLite image. Not bad, but not a well-defined and generally compatible data exchange format either.
* Export multiple CSV sheets that combine to a relational model. This is probably the cleanest solution but requires that you also specify how the sheets relate to each other, e.g. by releasing an SQL schema (or at least a list of foreign keys).
* Generate the export in a more expressive format, such as JSON. This is not such a bad idea, as it will leave rows as list items while giving you the ability to have lists and mappings as values.
* Have magic column names that allow folding and unfolding of nested structures. One nice example of this is formencode's [`NestedVariables`](http://formencode.org/Validator.html#http-html-form-input) which can also be used without the remainder of the library.

52 changes: 52 additions & 0 deletions text/data-wrangling-intro.md
@@ -0,0 +1,52 @@

An Introduction to Data Wrangling
=================================

A introduction to data wrangling covers obtaining, cleaning and using data. It's organized as a series of simple tasks that you work through.

Task 1: Finding a Question
--------------------------

Data isn't an end in itself. We usually want data to help us answer some question or help us do some activity.

So your first task will be to think of some question that you'd like to answer that could be answered by getting hold of some data.

Examples:

* How many people are without clean drinking water in the world this year?
* How much did my country spend on defence last year?
* What is a normal weight for someone my age?

You can find more questions, and requests for specific datasets, on [http://getthedata.org/]()

For the purposes of exposition, during the rest of this course we are going to focus on a specific example question shown below. Of course, you should focus on your own question.

**Our question: How much financial support did the US government give to banks and other companies during the financial crisis of 2008-2009?**

Task 2: Finding Data to Answer your Question
--------------------------------------------

Now's it time to go and find data to answer your question.

We'll focus on our own example question but you can use the same techniques for your own question.

**Our question: How much financial support did the US government give to banks and other companies during the financial crisis of 2008-2009?**

In locating relevant data you have two options

* Use a standard web search engine like Google, Yahoo etc
* Go directly to a specific service that allows you to search data relevant to your topic. For example if you know what you are looking for is likely to be in government statistics you can go directly to your official statistics agency site. Alternatively you can go to a dedicated data hub like [http://thedatahub.org/](). We usually recommend option 1 because a search on a generic engine will usually find datasets in more specific systems.

So in our case, we'll begin with a search on google:
```
"US government bailout data"
```

Using google or another search engine to find data is something of an art but the general approach is to start with a basic search and follow links using those results either directly to find data or to refine your results.

So in our case it becomes apparent that the US government have released official data. One link obtained is: [http://finacialstability.org]()

Which as of June 2011 redirects to: [http://www.treasury.gov/initiatives/financial-stability]()

On LHS sidebar there is a Investment Programs option which gives a list of investments. We have hit the motherlode. However, there are also some summary sites that seem to have consolidated versions of the data. For example, via [http://getthedata.org/questions/218/size-and-terms-of-us-government-support-for-insurer-aig]() we have: [http://subsidyscope.org/bailout/tarp/]()

0 comments on commit 3eb3e2c

Please sign in to comment.