Django application for accruing reference data that you can use across applications.
install pip
(sudo) easy_install pip
install beautiful soup 4
(sudo) pip install beautifulsoup4
install django
(sudo) pip install django
install South (data migration tool), if it isn't already installed.
(sudo) pip install South
install mechanize, a library that implements a basic browser client in python.
(sudo) pip install mechanize
in your work directory, create a django site. startproject <site_directory>
cd into the site_directory
cd <site_directory>
pull in python_utilities
git clone
pull in the python django_reference_data code
git clone
from the site_directory, cd into the site configuration directory, where is located (it is named the same as site_directory, but nested inside site_directory, alongside all the other django code you pulled in from git - <site_directory>/<same_name_as_site_directory>).
cd <same_name_as_site_directory>
in, set USE_TZ to false to turn off time zone support:
USE_TZ = False
configure the database in
I recommend using PostgreSQL. That said, I only recently arrived at this recommendation, and so don't haven't really followed it myself, so no detailed doc for it just yet (its unicode support is supposedly a gazillion times better than mysql).
For mysql:
create mysql database.
at the least, make your database use character set utf8 and collation utf8_unicode_ci
To support emoji and crazy characters, in mysql >= 5.5.2, you can try setting encoding to utf8mb4 and collation to utf8mb4_unicode_ci instead of utf8 and utf8_unicode_ci. It didn't work for me, but I converted the database instead of starting with it like that from scratch, so your mileage may vary. If you need to do this to an existing database:
ALTER DATABASE <database_name> CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
create user to interact with mysql database. Set permissions so user has all permissions to your database.
In, in the DATABASES structure:
- set the ENGINE to "django.db.backends.mysql"
- set the database NAME, USER, and PASSWORD.
- If the database is not on localhost, enter a HOST.
- If the database is listening on a non-standard port, enter a PORT.
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'. 'NAME': '<db_name>', # Or path to database file if using sqlite3. # The following settings are not used with sqlite3: 'USER': '<db_username>', 'PASSWORD': '<db_password>', 'HOST': '', # Empty for localhost through domain sockets or '' for localhost through TCP. 'PORT': '', # Set to empty string for default. } }
For sqlite3:
figure out what file you want to hold the database. For the initial implementation, we used reddit.sqlite in same directory as code (/home/socs/socs_reddit/reddit_collect/reddit.sqlite).
In, in the DATABASES structure:
- set the ENGINE to "django.db.backends.sqlite3"
- set the database NAME (path to file), USER and PASSWORD if you set one on the database.
- If the database is not on localhost, enter a HOST.
- If the database is listening on a non-standard port, enter a PORT.
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.sqlite3', # Add 'postgresql_psycopg2', 'mysql', 'sqlite3' or 'oracle'. 'NAME': '/home/socs/socs_reddit/reddit_collect/reddit.sqlite', # Or path to database file if using sqlite3. # The following settings are not used with sqlite3: 'USER': '', 'PASSWORD': '', 'HOST': '', # Empty for localhost through domain sockets or '' for localhost through TCP. 'PORT': '', # Set to empty string for default. } }
in, add 'south' to the INSTALLED_APPS list. Example:
INSTALLED_APPS = ( 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites', 'django.contrib.messages', 'django.contrib.staticfiles', # Uncomment the next line to enable the admin: # 'django.contrib.admin', # Uncomment the next line to enable admin documentation: # 'django.contrib.admindocs', 'south', )
Once database is configured in, in your site directory, run "python syncdb" to create database tables.
in, add 'django_reference_data' to the INSTALLED_APPS list. Example:
INSTALLED_APPS = ( 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites', 'django.contrib.messages', 'django.contrib.staticfiles', # Uncomment the next line to enable the admin: # 'django.contrib.admin', # Uncomment the next line to enable admin documentation: # 'django.contrib.admindocs', 'south', 'django_reference_data', )
python migrate django_reference_data
The easiest way to run code from a shell is to go to your django sites folder and use to open a shell:
python shell
If you choose, you can also just open the base python interpreter:
Or you can install something fancier like ipython (pip install ipython
). or install it using your OS's package manager), and then run ipython:
If you don't use to open a shell (or if you are making a shell script that will be run on its own), you'll have to do a little additional setup to pull in and configure django:
# make sure the site directory is in the sys path.
import sys
site_path = '<site_folder_full_path>'
if site_path not in sys.path:
sys.path.append( site_path )
#-- END check to see if site path is in sys.path. --#
# if not running in django shell (python shell), make sure django
# classes have access to
# set DJANGO_SETTINGS_MODULE environment variable = "<site_folder_name>.settings".
import os
os.environ[ 'DJANGO_SETTINGS_MODULE' ] = "<site_folder_name>.settings"
Then, there are numerous examples in /examples you can use to try out different ways of creating domain reference data (domains are the only type implemented at the moment):
- - pull in domains from alexa's lists of most popular web sites.
- - pull in domain names of CBS news' local affiliate stations' web sites.
- - pull in domain names of ABC local affiliate stations' web sites from ABC's web site.
- - pull in domain names of NBC local affiliate stations' web sites from a database table made by copying and pasting contents of NBC's affiliate finder by hand (yeah...).
- - pull in domain names of Fox local affiliate stations' web sites from Fox's web site.
- - pull in domain names of NPR stations' web sites from NPR Station Finder API, using city/state combinations from postal code database.
- ref_domain-from_listofnewspapers - pull in domains of newspaper websites in the United States from
These are good examples of how to use Beautiful Soup 4, also, if you are interested!
There is also a fixture of news domains you can import into the reference_domain model in /fixtures/reference_domains_news.json. To load it:
python loaddata reference_domains_news.json
The Postal_Code model is designed based on free data from, specifically the country-by-country tab-delimited files of postal codes. The fields are moved around a bit, but all fields in those files are in this database table, and if you wanted to, you could import them all.
To get these files, go to ( for the United States: ).
There is a fixture for this model that includes the postal codes for the United States, from a geonames file downloaded most recently on July 3, 2013. It is /fixtures/postal_codes_US.json. To load it:
python loaddata postal_codes_US.json
The original tab-delimited file zip archive is also in the respository: /examples/US-ZIP-2012.07.02 (includes readme, tab-delimited postal code file, and that same file converted to Excel).
Copyright 2012, 2013 Jonathan Morgan
This file is part of
django_reference_data is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
django_reference_data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with If not, see