## Connecting to a MySQL Database
Before you jump into the calculation exercises, let's begin by connecting to our database. Recall that in the last chapter you connected to a PostgreSQL database. Now, you'll connect to a MySQL database, for which many prefer to use the `pymysql` database driver, which, like `psycopg2` for PostgreSQL, you have to install prior to use.

This connection string is going to start with `'mysql+pymysql://'`, indicating which dialect and driver you're using to establish the connection. The dialect block is followed by the `'username:password'` combo. Next, you specify the host and port with the following `'@host:port/'`. Finally, you wrap up the connection string with the `'database_name'`.

Now you'll practice connecting to a `MySQL database`: it will be the same `census` database that you have already been working with. One of the great things about **SQLAlchemy is that, after connecting, it abstracts over the type of database it has connected to and you can write the same SQLAlchemy code, regardless!**

___

> To use PostgreSQL we need to install [PyMySQL][1] package to avoid this error: *ModuleNotFoundError: No module named 'pymysql'*.
> pip install pymysql

[1]: https://pypi.org/project/PyMySQL/


In [1]:
strConn_MySQL='mysql+pymysql://student:datacamp@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/census'

In [4]:
# Import create_engine function
from sqlalchemy import create_engine

# Create an engine to the census database
engine = create_engine(strConn_MySQL)

# create connection
connection = engine.connect()

# Print the table names
print(engine.table_names())

['census', 'state_fact']


## Calculating a Difference between Two Columns
Often, you'll need to perform math operations as part of a query, such as if you wanted to calculate the change in population from 2000 to 2008. For math operations on numbers, the operators in SQLAlchemy work the same way as they do in Python.

You can use these operators to perform addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), and modulus (`%`) operations. Note: They behave differently when used with non-numeric column types.

Let's now find the top 5 states by population growth between 2000 and 2008.

In [5]:
#import
from sqlalchemy import MetaData, Table, select, desc, func
metadata = MetaData()

# Reflect census and state_fact table via engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)
state_fact = Table('state_fact', metadata, autoload=True, autoload_with=engine)

# Build query to return state names by population difference from 2008 to 2000: stmt
stmt = select([census.columns.state, (census.columns.pop2008-census.columns.pop2000).label('pop_change')])

# Append group by for the state: stmt
stmt = stmt.group_by(census.columns.state)

# Append order by for pop_change descendingly: stmt
stmt = stmt.order_by(desc('pop_change'))

# Return only 5 results: stmt
stmt = stmt.limit(5)

# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))

Texas:40137
California:35406
Florida:21954
Arizona:14377
Georgia:13357


## Determining the Overall Percentage of Females
It's possible to combine functions and operators in a single select statement as well. These combinations can be exceptionally handy when we want to calculate percentages or averages, and we can also use the `case()` expression to operate on data that meets specific criteria while not affecting the query as a whole. The `case()` expression accepts a list of conditions to match and the column to return if the condition matches, followed by an `else_` if none of the conditions match. We can wrap this entire expression in any function or math operation we like.

Often when performing integer division, we want to get a float back. While some databases will do this automatically, you can use the `cast()` function to convert an expression to a particular type.

In [6]:
# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(
    case([
        (census.columns.sex == 'F', census.columns.pop2000)
    ], else_=0))

# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)

# Build a query to calculate the percentage of females in 2000: stmt
stmt = select([female_pop2000 / total_pop2000 * 100])

print(stmt)

SELECT (sum(CASE WHEN (census.sex = :sex_1) THEN census.pop2000 ELSE :param_1 END) / CAST(sum(census.pop2000) AS FLOAT)) * :param_2 AS anon_1 
FROM census


In [7]:
# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()

# Print the percentage
print(percent_female)

50.7455


  self.dialect.type_compiler.process(cast.typeclause.type))


## Automatic Joins with an Established Relationship
If you have two tables that already have an established relationship, you can automatically use that relationship by just adding the columns we want from each table to the select statement. Recall that Jason constructed the following query:

`stmt = select([census.columns.pop2008, state_fact.columns.abbreviation])`

in order to join the `census` and `state_fact` tables and select the `pop2008` column from the first and the `abbreviation` column from the second. In this case, the `census` and `state_fact` tables had a pre-defined relationship: the `state` column of the former corresponded to the `name` column of the latter.

In this exercise, you'll use the same predefined relationship to select the `pop2000` and abbreviation columns!

In [8]:
# Build a statement to join census and state_fact tables: stmt
stmt = select([census.columns.pop2000, state_fact.columns.abbreviation])

# Execute the statement and get the first result: result
result = connection.execute(stmt).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))

pop2000 89600
abbreviation IL


## Joins
If you aren't selecting columns from both tables or the two tables don't have a defined relationship, you can still use the `.join()` method on a table to join it with another table and get extra data related to our query. The `join()` takes the table object you want to join in as the first argument and a condition that indicates how the tables are related to the second argument. Finally, you use the `.select_from()` method on the select statement to wrap the join clause. For example, in the video, Jason executed the following code to join the `census` table to the `state_fact` table such that the `state` column of the `census` table corresponded to the `name` column of the `state_fact` table.

``` python
stmt = stmt.select_from(
    census.join(
        state_fact, census.columns.state == 
        state_fact.columns.name)
```    

In [9]:
# Build a statement to select the census and state_fact tables: stmt
stmt = select([census, state_fact])

# Add a select_from clause that wraps a join for the census and state_fact
# tables where the census state column and state_fact name column match
stmt = stmt.select_from(
    census.join(state_fact, census.columns.state == state_fact.columns.name))

# Execute the statement and get the first result: result
result = connection.execute(stmt).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))

state Illinois
sex M
age 0
pop2000 89600
pop2008 95012
id 13
name Illinois
abbreviation IL
country USA
type state
sort 10
status current
occupied occupied
notes 
fips_state 17
assoc_press Ill.
standard_federal_region V
census_region 2
census_region_name Midwest
census_division 3
census_division_name East North Central
circuit_court 7


## More Practice with Joins
You can use the same select statement you built in the last exercise, however, let's add a twist and only return a few columns and use the other table in a `group_by()` clause.

In [10]:
# import pandas
import pandas as pd

# Build a statement to select the state, sum of 2008 population and census division name: stmt
stmt = select([
    census.columns.state,
    func.sum(census.columns.pop2008),
    state_fact.columns.census_division_name
])

# Append select_from to join the census and state_fact tables by the census state and state_fact name columns
stmt = stmt.select_from(
    census.join(state_fact, census.columns.state == state_fact.columns.name)
)

# Append a group by for the state_fact name column
stmt = stmt.group_by(state_fact.columns.name)

# Execute the statement and get the results: results
results = connection.execute(stmt).fetchall()

# Create a DataFrame from the results: df
df = pd.DataFrame(results)

# Set column names
df.columns = results[0].keys()

# Print the Dataframe
df.head()

Unnamed: 0,state,sum_1,census_division_name
0,Alabama,4681422,East South Central
1,Alaska,664546,Pacific
2,Arizona,10698743,Mountain
3,Arkansas,4343608,West South Central
4,California,56952946,Pacific


# Working on Blocks of Records
Fantastic work so far! As Jason discussed in the video, sometimes you may have the need to work on a large ResultProxy, and you may not have the memory to load all the results at once. To work around that issue, you can get blocks of rows from the ResultProxy by using the `.fetchmany()` method inside a `loop`. With `.fetchmany()`, give it an argument of the number of records you want. When you reach an empty list, there are no more rows left to fetch, and you have processed all the results of the query. Then you need to use the `.close()` method to close out the connection to the database.

You'll now have the chance to practice this on a large ResultProxy called `results_proxy` that has all data of `census` table

In [16]:
# Build a statement to select the state, sum of 2008 population and census division name: stmt
stmt = select([census])

# Execute the statement 
results_proxy = connection.execute(stmt)

# Condition
more_results = True

# empty dict using empty brackets
state_count = {}

# Start a while loop checking for more results
while more_results:
    # Fetch the first 50 results from the ResultProxy: partial_results
    partial_results = results_proxy.fetchmany(50)

    # if empty list, set more_results to False
    if partial_results == []:
        more_results = False

    # Loop over the fetched records and increment the count for the state
    for row in partial_results:
        if row.state in state_count:
            state_count[row.state] += 1
        else:
            state_count[row.state] = 1

# Close the ResultProxy, and thus the connection
results_proxy.close()

# Print the count by state
state_count

{'Illinois': 210,
 'New Jersey': 172,
 'District of Columbia': 188,
 'North Dakota': 172,
 'Florida': 196,
 'Maryland': 217,
 'Idaho': 172,
 'Massachusetts': 172,
 'Oregon': 218,
 'Nevada': 266,
 'Michigan': 239,
 'Wisconsin': 232,
 'Missouri': 242,
 'Washington': 206,
 'North Carolina': 172,
 'Arizona': 287,
 'Arkansas': 260,
 'Colorado': 247,
 'Indiana': 202,
 'Pennsylvania': 230,
 'Hawaii': 172,
 'Kansas': 255,
 'Louisiana': 197,
 'Alabama': 173,
 'Minnesota': 229,
 'South Dakota': 172,
 'New York': 232,
 'California': 262,
 'Connecticut': 182,
 'Ohio': 249,
 'Rhode Island': 180,
 'Georgia': 172,
 'South Carolina': 172,
 'Alaska': 172,
 'Delaware': 172,
 'Tennessee': 230,
 'Vermont': 222,
 'Montana': 210,
 'Kentucky': 190,
 'Utah': 172,
 'Nebraska': 221,
 'West Virginia': 172,
 'Iowa': 172,
 'Wyoming': 208,
 'Maine': 256,
 'New Hampshire': 260,
 'Mississippi': 265,
 'Oklahoma': 228,
 'New Mexico': 268,
 'Virginia': 214,
 'Texas': 270}