# Census Case Study

Bring together all of the skills you acquired to work on a real-life project. From connecting to a database and populating it, to reading and querying it.

In [1]:
path='data/dc25/'

## Setup the engine and metadata

In this exercise, your job is to create an engine to the database that will be used in this case study. Then, you need to initialize its metadata.

In [9]:
# Import create_engine, MetaData
from sqlalchemy import create_engine, MetaData

# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///'+path+'CensusCaseStudy.sqlite')

# Create a connection on engine
connection = engine.connect()

# Initialize MetaData: metadata
metadata = MetaData()

## Create the table to the database

Having setup the engine and initialized the metadata, you will now define the census table object and then create it in the database using the metadata and engine. To create it in the database, you will have to use the `.create_all()` method on the metadata with engine as the argument.

In [3]:
# Import Table, Column, String, and Integer
from sqlalchemy import (Table, Column, String, Integer, Boolean)

# Build a census table: census
census = Table('census', metadata,
               Column('state', String(30)),
               Column('sex', String(1)),
               Column('age', Integer()),
               Column('pop2000', Integer()),
               Column('pop2008', Integer()))

# Create the table in the database
metadata.create_all(engine)

When creating columns of type `String()`, it's important to spend some time thinking about what their maximum lengths should be.

## Reading the data from the CSV

Leverage the Python CSV module from the standard library and load the data into a list of dictionaries.

In [5]:
import csv

# Create an empty list: values_list
values_list = []

with open(path+'census.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        # Create a dictionary with the values
        data = {'state': row[0], 'sex': row[1], 'age':row[2], 'pop2000': row[3],
                'pop2008': row[4]}
        # Append the dictionary to the values list
        values_list.append(data)

In [7]:
values_list[:5]

[{'state': 'Illinois',
  'sex': 'M',
  'age': '0',
  'pop2000': '89600',
  'pop2008': '95012'},
 {'state': 'Illinois',
  'sex': 'M',
  'age': '1',
  'pop2000': '88445',
  'pop2008': '91829'},
 {'state': 'Illinois',
  'sex': 'M',
  'age': '2',
  'pop2000': '88729',
  'pop2008': '89547'},
 {'state': 'Illinois',
  'sex': 'M',
  'age': '3',
  'pop2000': '88868',
  'pop2008': '90037'},
 {'state': 'Illinois',
  'sex': 'M',
  'age': '4',
  'pop2000': '91947',
  'pop2008': '91111'}]

## Load data from a list into the Table

Using the multiple insert pattern, in this exercise, you will load the data from values_list into the table.

In [10]:
# Import insert
from sqlalchemy import insert

# Build insert statement: stmt
stmt = insert(census)

# Use values_list to insert data: results
results = connection.execute(stmt, values_list)

# Print rowcount
print(results.rowcount)

8772


## Determine the average age by population

To calculate a weighted average, we first find the total sum of weights multiplied by the values we're averaging, then divide by the sum of all the weights.

For example, if we wanted to find a weighted average of `data = [10, 30, 50]` weighted by `weights = [2,4,6]`, we would compute `(2⋅10+4⋅30+6⋅50) / (2+4+6)`, or `sum(weights * data) / sum(weights)`.

In this exercise, however, you will make use of `func.sum()` together with `select` to select the weighted average of a column from a table. You will still work with the census data, and you will compute the average of age weighted by state population in the year 2000, and then group this weighted average by sex.

In [11]:
# Import select and func
from sqlalchemy import select, func

# Select sex and average age weighted by 2000 population
stmt = select([(func.sum(census.columns.pop2000 * census.columns.age) \
                / func.sum(census.columns.pop2000)).label('average_age'),
               census.columns.sex])

# Group by sex
stmt = stmt.group_by(census.columns.sex)

# Execute the query and fetch all the results
results = connection.execute(stmt).fetchall()

# Print the sex and average age column for each result
for result in results:
    print(result.sex, result.average_age)

F 37
M 34


## 