# Welcome to We Module!

We Module is a set of Python functions developed by the data engineering team.

It is primarily intended to help you with connecting to, querying and transforming the data within the data warehouse.

Before you can do anything, you need to configure We module locally. See https://github.com/WeConnect/we_module/blob/master/README.md

Once that is done, you can get started by importing we module itself.

In [None]:
from we_module.we import We

What is this module you just imported? For most things within We Module, you can find out more information about them by running the help function.

In [None]:
help(We)

To get started, you need to initialize your we object. If you pass True for the parameter debug_mode, We Module will print out its logs as it runs to give you and idea of what is happening.

In [None]:
we = We(True)

You can see that we module has established a connection with redshift. It did this using the redshift connection string you provided when you ran configure_computer.

So now you have a database connection. Lets run a query! The method you will use for this is get_tbl_query. Lets see what it does:

In [None]:
help(we.get_tbl_query)

It looks like this function takes in a query and returns a dataframe, lets try it. A popular table in the data warehouse is mv_dim_location, which stores information about all the different Wework locations.

In [None]:
my_df = we.get_tbl_query("SELECT * FROM dw.mv_dim_location")
my_df.head()

Wow! So simple! But you can already view the contents of the data warehouse with psql or your SQL GUI of choice, what if you want to actually manipulate your data and save it to the database?

We Module works in tandem with another module called Materialize (also contained within the we_module git repo). Materialize can help you with persisting your dataframes back to the data warehouse. The primary means of interacting with the Materialize module is by writing a "data script". "Data scripts" are run by "Data Manager", which is really just a server that git pulls the "data_scripts" git repo every once in a while and schedules a run for everything it finds.

Now it is time for you to write a data script of your own.

In this folder (data_scripts/tutorial), take a look at "my_first_datascript.py". This file consists of two major functions, config, and main. 

The config function returns an ordinary python dictionary. Data Manager reads the keys from this dictionary in order to know how to run a data script. The dictionary provides information such as the name of the table, how often to run the script, what "type" of script it is, and also who to contact if something goes wrong.

The 'main' function is where the real functionality of a data script lies. In most cases, the return value of main will be what is used to update the data warehouse.

There are four "types" of data scripts. "normal", "incremental", "state_capture", and "none".

A "normal" script will replace the existing table with the dataframe that is returned by main.

An "incremental" script will append the data frame returned by main to the existing table based on a comparison to the "incremental_key".

A "state_capture" script will append the entire data frame returned by main to the existing table.

A "none" type (string "none" not Python None) script doesn't do anything with the return value. It is typically used for things like email alerts that don't alter the database.

"my_first_datascript.py" is a "normal" type data script. Lets walk through what it is doing.

You are going to be working with dataframes, so you need to import pandas.

In [None]:
import pandas as pd

The main function of a data script is automatically passed a we object as the first parameter. However to simulate this you can just use the we object you created earlier.

First you should get the list of all users who are already we_module users.

In [None]:
users_df = we.get_tbl_query('''
        SELECT
            name,
            email
        FROM
            dw.we_module_tutorial_users
    ''')
users_df

Now modify the above to create another data frame that includes your own name.

In [None]:
my_info = {
        'name': ['Your Name'],
        'email': ['your.name@wework.com']
    }
my_df = pd.DataFrame.from_dict(my_info)
my_df

Next you should to append the two dataframes together, in order to add your name to the list of we_module users.

The data frame you see below should be the same as the table we_module_tutorial_users once you run your script (with the addition of a _run_at column to let you know when the script ran).

In [None]:
final_df = users_df.append(my_df)
final_df #this is what we should return

So that is what my_first_datascript is doing. Lets test it by simulating the process data manager would follow.

Note: This won't work unless you have write permission for the dw schema. If you don't know if you have write permission, you almost certainly don't. If it doesn't work, just use your imaginination.

In [None]:
from we_module.materialize import Materialize
help(Materialize)

In [None]:
mat = Materialize(we) #Materialize uses the We Module connection to operate, so we have to pass it the we object
mat.load_script('./my_first_datascript.py') #load the script
mat.materialize_script(mat.config['schedule'][0]) #materialize it (this is what actually makes the changes to the database). It takes a little while.

When the script is complete, it will return a datetime that reflects the completion time.

Now you have materialized your script! If you query the we_module_tutorial_users table again you should see your name: 

In [None]:
we.get_tbl_query('''
        SELECT
            name,
            email
        FROM
            dw.we_module_tutorial_users
    ''')