## GreenDS
# Data Management and Storage
## Access to MySQL from Jupyter Notebook

## Introduction
This Jupyter Notebook is part of exercise *dms_ex_13_mysql_python*. The purpose of this Jupyter Notebook is to demonstrate how to connect to a MySQL database to retrieve data, using SQL queries. We will import data to a Pandas Dataframe, and make some analysis afterwards.

Let's begin.

## 1. Install necessary modules

The following modules are necessary: *ipyhton-sql*, *mysql-python* and *mysqlclient*. ***ipyhton-sql*** will allow to write iPython Magic commands in SQL. This will facilitate to run SQL queries, specially the longer ones that take more than one line.

In [None]:
!pip install ipython-sql
!pip install mysql-python
!pip install mysqlclient

## 2. Import the necessary modules

In [None]:
import pandas as pd
import MySQLdb

And to load the iPython magic SQL module, we run the following;:

In [None]:
%load_ext sql

## 3. Make the connection to the database
For this, you need to provide the following parameters:
- username
- password
- server name - if the server is the local machine, then the value is *localhost*
- name of the database

The connection sintaxe is the following
**mysql://`<username>`:`<password>`@`<localhost>`/`<dms_INE>`**

You need to replace these values with your own values in the following code:

In [None]:
%sql mysql://dms_user:22dms2022@localhost/dms_INE

## 4. Make queries
We can write a simple query as follows, storing the result set of the query:

In [None]:
result = %sql SELECT * FROM region LIMIT 10;

And afterwards, import the result set to a Pandas DataFrame:

In [None]:
pdf = result.DataFrame()
pdf

## Bigger queries
If the query  is too long, it is easier to set the Jupyter in mode _magic_. We do this by adding `%%sql` to the begining of the cell. If we do that, it indicates that the whoke cell is SQL, then we can write SQL statements like if we were at DBeaver, MySQL command line or another mysql client. It can contain also multiple SQL statements, but only the result of the last one will be the output. This output was assigned to the local variable `result2` with the `<<` operator.

**Repeat the last query of ex 12:**

_Get the sum of the number of familiar education per level of education  
 for 2019, at the freguesia level, for freguesias that belong to the NUTS3 region 
 'Algarve.
 Output the NUTS3 name, municipality, freguesia, year, education level and
 sum of familiar members with that level of education. 
 Remove the education level with the value 'Total'_

In [None]:
%%sql

result2 << SELECT 
    r3.region_name,
    r2.region_name ,
    r.region_name,
	e.`year` ,
	el.education_level,
	sum(e.value) AS sum_education
FROM
	education e
INNER JOIN education_level el ON
	e.education_level_ID = el.education_level_ID
INNER JOIN region r ON
	e.NutsID = r.NutsID
INNER JOIN region r2 ON
	r.ParentCodeID = r2.NutsID
INNER JOIN region r3 ON
	r2.ParentCodeID = r3.NutsID
WHERE
	el.education_level <> 'Total'
	AND r.level_ID = 5
	AND r3.region_name = 'Algarve'
	AND e.`year` = 2019
GROUP BY
	r.region_name, el.education_level;

In [None]:
pdf1 = result2.DataFrame()
pdf1

List the tables in the database.

In [None]:
%sql show tables;

Q.1. _Obtain the number of total annual working unit (AWU) for municipalities that have the area of vineyeards higher than 10 ha, for year 2019. List the municipality name, year, area._ 

In [None]:
%%sql 

/* write your code here */


## 5. Create a graph of the Agricultural Census 2019 dashboard
In the beginning of the couse we defined our goal to create a database using the data made available by INE for the Agricultural Census 2019, that would allow to reproduce the charts included in the [AC dashboard](https://www.ine.pt/scripts/db_ra_2019.html).

While this is not fully possible for all charts, because some of the tables still need to be dowsloaded, preprocessed and imported to the database (like the data about the [Utilised Agriculture Area](https://www.ine.pt/xportal/xmain?xpid=INE&xpgid=ine_indicadores&indOcorrCod=0010518&contexto=bd&selTab=tab2&xlang=en), we can create charts for the **Permanent Crops** and **Temporary Crops**.

We will start by the permanent crops. We need to make a query to obtain the number of holdings with permanent crop per type of crop at the NUTS2 level for year 2019.

In [None]:
%%sql 

perm_crop_result << SELECT
	pcn.crop_name ,
	SUM(pc.`hold`) AS sum_holdings
FROM
	permanent_crop pc
INNER JOIN permanent_crop_name pcn
ON
	pc.pc_name_ID = pcn.pc_name_ID
INNER JOIN region r ON
	pc.NutsID = r.NutsID
WHERE
	pc.`year` = 2019
	AND 
r.level_ID = 2
	AND pcn.crop_name <> 'Total'
GROUP BY
	pcn.pc_name_ID
ORDER BY
	sum_holdings DESC ;

Before we plot, let's import the result to a Pandas dataframe.

In [None]:
perm_crop_df = perm_crop_result.DataFrame()

It is useful to check the structure of the dataframe created.

In [None]:
perm_crop_df.info()

We can see that the values of the number of holdings is an object. But to be ploted, it should be of type integer. We can do the change with the following:

In [None]:
perm_crop_df['sum_holdings'] = perm_crop_df['sum_holdings'].astype(str).astype(int)

And we can set the crop name to be the index of the dataframe:

In [None]:
perm_crop_df = perm_crop_df.set_index(perm_crop_df['crop_name'])

We can, finaly, make the barplot:

In [None]:
perm_crop_df.plot(kind='bar')

## 6. Do the chart for temporary crops
Repeat the query and chart creation, but for temporary crops.

In [None]:
## write your code here